+ - 0:00:00
Notes for current slide
Notes for next slide

These slides are viewed best by Chrome or Firefox and occasionally need to be refreshed if elements did not load properly. See here for the PDF .


Press the right arrow to progress to the next slide!

1/31

ETC5521: Exploratory Data Analysis


Initial data analysis

Lecturer: Emi Tanaka

ETC5521.Clayton-x@monash.edu

Week 3 - Session 1


1/31

Data Analysis

Data analysis is a process of cleaning, transforming, inspecting and modelling data with the aim of extracting information.


  • Data analysis includes:
    • exploratory data analysis,
    • confirmatory data analysis, and
    • initial data analysis.
2/31

Data Analysis

Data analysis is a process of cleaning, transforming, inspecting and modelling data with the aim of extracting information.


  • Data analysis includes:
    • exploratory data analysis,
    • confirmatory data analysis, and
    • initial data analysis.
  • Confirmatory data analysis is focussed on statistical inference and includes processes such as testing hypothesis, model selection, or predictive modelling...
2/31

Data Analysis

Data analysis is a process of cleaning, transforming, inspecting and modelling data with the aim of extracting information.


  • Data analysis includes:
    • exploratory data analysis,
    • confirmatory data analysis, and
    • initial data analysis.
  • Confirmatory data analysis is focussed on statistical inference and includes processes such as testing hypothesis, model selection, or predictive modelling... but today's focus will be on initial data analysis.
2/31

Initial Data Analysis (IDA)

  • There are various definitions of IDA, much like there are numerous definitions for EDA.
  • Some people would be practicing IDA without realising that it is IDA.
  • Or other cases, a different name is used to describe the same process, such as Chatfield (1985) referring to IDA also as "initial examination of data" and Cox & Snell (1981) as "preliminary data anlysis" and Rao (1983) as "cross-examination of data".

Chatfield (1985) The Initial Examination of Data. Journal of the Royal Statistical Society. Series A (General) 148
Cox & Snell (1981) Applied Statistics. London: Chapman and Hall.
Rao (1983) Optimum balance between statistical theory and application in teaching. Proc. of the First Int Conference on Teaching Statistics 34-49

3/31

Initial Data Analysis (IDA)

  • There are various definitions of IDA, much like there are numerous definitions for EDA.
  • Some people would be practicing IDA without realising that it is IDA.
  • Or other cases, a different name is used to describe the same process, such as Chatfield (1985) referring to IDA also as "initial examination of data" and Cox & Snell (1981) as "preliminary data anlysis" and Rao (1983) as "cross-examination of data".

Chatfield (1985) The Initial Examination of Data. Journal of the Royal Statistical Society. Series A (General) 148
Cox & Snell (1981) Applied Statistics. London: Chapman and Hall.
Rao (1983) Optimum balance between statistical theory and application in teaching. Proc. of the First Int Conference on Teaching Statistics 34-49


So what is IDA?

3/31

What is IDA?

The two main objectives for IDA are:

  1. data description, and
  2. model formulation.
4/31

What is IDA?

The two main objectives for IDA are:

  1. data description, and
  2. model formulation.
  • IDA differs from the main analysis (i.e. usually fitting the model, conducting significance tests, making inferences or predictions).
4/31

What is IDA?

The two main objectives for IDA are:

  1. data description, and
  2. model formulation.
  • IDA differs from the main analysis (i.e. usually fitting the model, conducting significance tests, making inferences or predictions).
  • IDA is often unreported in the data analysis reports or scientific papers due to it being "uninteresting" or "obvious".
4/31

What is IDA?

The two main objectives for IDA are:

  1. data description, and
  2. model formulation.
  • IDA differs from the main analysis (i.e. usually fitting the model, conducting significance tests, making inferences or predictions).
  • IDA is often unreported in the data analysis reports or scientific papers due to it being "uninteresting" or "obvious".
  • The role of the main analysis is to answer the intended question(s) that the data were collected for.
4/31

What is IDA?

The two main objectives for IDA are:

  1. data description, and
  2. model formulation.
  • IDA differs from the main analysis (i.e. usually fitting the model, conducting significance tests, making inferences or predictions).
  • IDA is often unreported in the data analysis reports or scientific papers due to it being "uninteresting" or "obvious".
  • The role of the main analysis is to answer the intended question(s) that the data were collected for.
  • Sometimes IDA alone is sufficient.
4/31

1 Data Description Part 1/2

  • Data description should be one of the first steps in the data analysis to assess the structure and quality of the data.
5/31

1 Data Description Part 1/2

  • Data description should be one of the first steps in the data analysis to assess the structure and quality of the data.
  • We refer them to occasionally as data sniffing or data scrutinizing.
5/31

1 Data Description Part 1/2

  • Data description should be one of the first steps in the data analysis to assess the structure and quality of the data.
  • We refer them to occasionally as data sniffing or data scrutinizing.
  • These include using common or domain knowledge to check if the recorded data have sensible values.
5/31

1 Data Description Part 1/2

  • Data description should be one of the first steps in the data analysis to assess the structure and quality of the data.
  • We refer them to occasionally as data sniffing or data scrutinizing.
  • These include using common or domain knowledge to check if the recorded data have sensible values. E.g.
    • Are positive values, e.g. height and weight, recorded as positive values with a plausible range?
5/31

1 Data Description Part 1/2

  • Data description should be one of the first steps in the data analysis to assess the structure and quality of the data.
  • We refer them to occasionally as data sniffing or data scrutinizing.
  • These include using common or domain knowledge to check if the recorded data have sensible values. E.g.

    • Are positive values, e.g. height and weight, recorded as positive values with a plausible range?

    • If the data are counts, do the recorded values contain non-integer values?

5/31

1 Data Description Part 1/2

  • Data description should be one of the first steps in the data analysis to assess the structure and quality of the data.
  • We refer them to occasionally as data sniffing or data scrutinizing.
  • These include using common or domain knowledge to check if the recorded data have sensible values. E.g.

    • Are positive values, e.g. height and weight, recorded as positive values with a plausible range?

    • If the data are counts, do the recorded values contain non-integer values?

    • For compositional data, do the values add up to 100% (or 1)? If not is that a measurement error or due to rounding? Or is another variable missing?

5/31

1 Data Description Part 2/2

  • In addition, numerical or graphical summaries may reveal that there is unwanted structure in the data. E.g.,
    • Does the treatment group have different demographic characteristics to the control group?
    • Does the distribution of the data imply violations of assumptions for the main analysis?
6/31

1 Data Description Part 2/2

  • In addition, numerical or graphical summaries may reveal that there is unwanted structure in the data. E.g.,
    • Does the treatment group have different demographic characteristics to the control group?
    • Does the distribution of the data imply violations of assumptions for the main analysis?
  • Data sniffing or data scrutinizing is a process that you get better at with practice and have familiarity with the domain area.
6/31

1 Data Description Part 2/2

  • In addition, numerical or graphical summaries may reveal that there is unwanted structure in the data. E.g.,
    • Does the treatment group have different demographic characteristics to the control group?
    • Does the distribution of the data imply violations of assumptions for the main analysis?
  • Data sniffing or data scrutinizing is a process that you get better at with practice and have familiarity with the domain area.

  • Aside from checking the data structure or data quality, it's important to check how the data are understood by the computer, i.e. checking for data type is also important. E.g.,

    • Was the date read in as character?
    • Was a factor read in as numeric?
6/31

Next we'll see some illustrative examples and cases based on real data with some R codes


  • Note: that there are a variety of ways to do IDA & EDA and you don't need to prescribe to what we show you.
7/31

Example 1 Checking the data type Part 1/2

lecture3-example.xlsx

library(readxl)
library(here)
df <- read_excel(here("data/lecture3-example.xlsx"))
df
## # A tibble: 5 x 4
## id date loc temp
## <dbl> <dttm> <chr> <dbl>
## 1 1 2010-01-03 00:00:00 New York 42
## 2 2 2010-02-03 00:00:00 New York 41.4
## 3 3 2010-03-03 00:00:00 New York 38.5
## 4 4 2010-04-03 00:00:00 New York 41.1
## 5 5 2010-05-03 00:00:00 New York 39.8

Any issues here?

8/31

Example 1 Checking the data type Part 2/2

library(lubridate)
df %>%
mutate(id = as.factor(id),
day = day(date),
month = month(date),
year = year(date)) %>%
select(-date)
## # A tibble: 5 x 6
## id loc temp day month year
## <fct> <chr> <dbl> <int> <dbl> <dbl>
## 1 1 New York 42 3 1 2010
## 2 2 New York 41.4 3 2 2010
## 3 3 New York 38.5 3 3 2010
## 4 4 New York 41.1 3 4 2010
## 5 5 New York 39.8 3 5 2010
  • id is now a factor instead of integer
  • day, month and year are now extracted from the date
  • Is it okay now?
9/31

Example 1 Checking the data type Part 2/2

library(lubridate)
df %>%
mutate(id = as.factor(id),
day = day(date),
month = month(date),
year = year(date)) %>%
select(-date)
## # A tibble: 5 x 6
## id loc temp day month year
## <fct> <chr> <dbl> <int> <dbl> <dbl>
## 1 1 New York 42 3 1 2010
## 2 2 New York 41.4 3 2 2010
## 3 3 New York 38.5 3 3 2010
## 4 4 New York 41.1 3 4 2010
## 5 5 New York 39.8 3 5 2010
  • id is now a factor instead of integer
  • day, month and year are now extracted from the date
  • Is it okay now?
  • In the United States, it's common to use the date format MM/DD/YYYY (gasps) while the rest of the world commonly use DD/MM/YYYY or YYYY/MM/DD.
9/31

Example 1 Checking the data type Part 2/2

library(lubridate)
df %>%
mutate(id = as.factor(id),
day = day(date),
month = month(date),
year = year(date)) %>%
select(-date)
## # A tibble: 5 x 6
## id loc temp day month year
## <fct> <chr> <dbl> <int> <dbl> <dbl>
## 1 1 New York 42 3 1 2010
## 2 2 New York 41.4 3 2 2010
## 3 3 New York 38.5 3 3 2010
## 4 4 New York 41.1 3 4 2010
## 5 5 New York 39.8 3 5 2010
  • id is now a factor instead of integer
  • day, month and year are now extracted from the date
  • Is it okay now?
  • In the United States, it's common to use the date format MM/DD/YYYY (gasps) while the rest of the world commonly use DD/MM/YYYY or YYYY/MM/DD.
  • It's highly probable that the dates are 1st-5th March and not 3rd of Jan-May.
9/31

Example 1 Checking the data type Part 2/2

library(lubridate)
df %>%
mutate(id = as.factor(id),
day = day(date),
month = month(date),
year = year(date)) %>%
select(-date)
## # A tibble: 5 x 6
## id loc temp day month year
## <fct> <chr> <dbl> <int> <dbl> <dbl>
## 1 1 New York 42 3 1 2010
## 2 2 New York 41.4 3 2 2010
## 3 3 New York 38.5 3 3 2010
## 4 4 New York 41.1 3 4 2010
## 5 5 New York 39.8 3 5 2010
  • id is now a factor instead of integer
  • day, month and year are now extracted from the date
  • Is it okay now?
  • In the United States, it's common to use the date format MM/DD/YYYY (gasps) while the rest of the world commonly use DD/MM/YYYY or YYYY/MM/DD.
  • It's highly probable that the dates are 1st-5th March and not 3rd of Jan-May.
  • You can validate this with other variables, say the temperature here.
9/31

Example 1 Checking the data type with R Part 1/3

  • You can robustify your workflow by ensuring you have a check for the expected data type in your code.
xlsx_df <- read_excel(here("data/lecture3-example.xlsx"),
col_types = c("text", "date", "text", "numeric")) %>%
mutate(id = as.factor(id),
date = as.character(date),
date = as.Date(date, format = "%Y-%d-%m"))
  • read_csv has a broader support for col_types
csv_df <- read_csv(here("data/lecture3-example.csv"),
col_types = cols(
id = col_factor(),
date = col_date(format = "%m/%d/%y"),
loc = col_character(),
temp = col_double()))
  • The checks (or coercions) ensure that even if the data are updated, you can have some confidence that any data type error will be picked up before further analysis.
10/31

Example 1 Checking the data type with R Part 2/3

You can have a quick glimpse of the data type with:

dplyr::glimpse(xlsx_df)
## Rows: 5
## Columns: 4
## $ id <fct> 1, 2, 3, 4, 5
## $ date <date> 2010-03-01, 2010-03-02, 2010-03-03, 2010-03-04, 2010-03-05
## $ loc <chr> "New York", "New York", "New York", "New York", "New York"
## $ temp <dbl> 42.0, 41.4, 38.5, 41.1, 39.8
dplyr::glimpse(csv_df)
## Rows: 5
## Columns: 4
## $ id <fct> 1, 2, 3, 4, 5
## $ date <date> 2010-03-01, 2010-03-02, 2010-03-03, 2010-03-04, 2010-03-05
## $ loc <chr> "New York", "New York", "New York", "New York", "New York"
## $ temp <dbl> 42.0, 41.4, 38.5, 41.1, 39.8
11/31

Example 1 Checking the data type with R Part 3/3

You can also visualise the data type with:

library(visdat)
vis_dat(xlsx_df)

library(inspectdf)
inspect_types(xlsx_df) %>%
show_plot()

12/31

Example 2 Checking the data quality

df2 <- read_csv(here("data/lecture3-example2.csv"),
col_types = cols(id = col_factor(),
date = col_date(format = "%m/%d/%y"),
loc = col_character(),
temp = col_double()))
df2
## # A tibble: 9 x 4
## id date loc temp
## <fct> <date> <chr> <dbl>
## 1 1 2010-03-01 New York 42
## 2 2 2010-03-02 New York 41.4
## 3 3 2010-03-03 New York 38.5
## 4 4 2010-03-04 New York 41.1
## 5 5 2010-03-05 New York 39.8
## 6 6 2020-03-01 Melbourne 30.6
## 7 7 2020-03-02 Melbourne 17.9
## 8 8 2020-03-03 Melbourne 18.6
## 9 9 2020-03-04 <NA> 21.3
  • Numerical or graphical summaries or even just eye-balling the data helps to uncover some data quality issues.
  • Any issues here?
13/31

Example 2 Checking the data quality

df2 <- read_csv(here("data/lecture3-example2.csv"),
col_types = cols(id = col_factor(),
date = col_date(format = "%m/%d/%y"),
loc = col_character(),
temp = col_double()))
df2
## # A tibble: 9 x 4
## id date loc temp
## <fct> <date> <chr> <dbl>
## 1 1 2010-03-01 New York 42
## 2 2 2010-03-02 New York 41.4
## 3 3 2010-03-03 New York 38.5
## 4 4 2010-03-04 New York 41.1
## 5 5 2010-03-05 New York 39.8
## 6 6 2020-03-01 Melbourne 30.6
## 7 7 2020-03-02 Melbourne 17.9
## 8 8 2020-03-03 Melbourne 18.6
## 9 9 2020-03-04 <NA> 21.3
  • Numerical or graphical summaries or even just eye-balling the data helps to uncover some data quality issues.
  • Any issues here?

  • There's a missing value in loc.
  • Temperature is in Farenheit for New York but Celsius in Melbourne (you can validate this again using external sources).
13/31

Case study 1 Soybean study in Brazil Part 1/3

data("lehner.soybeanmold", package = "agridat")
skimr::skim(lehner.soybeanmold)
## ── Data Summary ────────────────────────
## Values
## Name lehner.soybeanmold
## Number of rows 382
## Number of columns 9
## _______________________
## Column type frequency:
## factor 4
## numeric 5
## ________________________
## Group variables None
##
## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate ordered n_unique top_counts
## 1 study 0 1 FALSE 35 S01: 13, S02: 13, S03: 13, S04: 13
## 2 loc 0 1 FALSE 14 Sao: 56, Mon: 44, Pon: 44, Mau: 34
## 3 region 0 1 FALSE 2 Nor: 273, Sou: 109
## 4 trt 0 1 FALSE 13 T01: 35, T02: 35, T03: 35, T04: 35
##
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 year 0 1 2010. 0.981 2009 2010 2010 2011 2012 ▃▇▁▃▃
## 2 elev 0 1 951. 83.6 737 909 947 1027 1050 ▁▂▅▂▇
## 3 yield 0 1 3117. 764. 1451 2528. 2994. 3642. 4908 ▂▇▇▃▃
## 4 mold 0 1 19.7 18.1 0 7 13 27.2 90.3 ▇▃▂▁▁
## 5 sclerotia 66 0.827 1854. 2034. 0 421 1082 2606 11000 ▇▂▁▁▁
scroll

Lehner, M. S., Pethybridge, S. J., Meyer, M. C., & Del Ponte, E. M. (2016). Meta-analytic modelling of the incidence-yield and incidence-sclerotial production relationships in soybean white mould epidemics. Plant Pathology. doi:10.1111/ppa.12590

14/31

Case study 1 Soybean study in Brazil Part 2/3

vis_miss(lehner.soybeanmold)

inspect_na(lehner.soybeanmold) %>%
show_plot()

15/31

Case study 1 Soybean study in Brazil Part 3/3

Checking if missing values have different yields:

library(naniar)
ggplot(lehner.soybeanmold,
aes(sclerotia, yield)) +
geom_miss_point() +
scale_color_discrete_qualitative()

Compare the new with old data:

soy_old <- lehner.soybeanmold %>%
filter(year %in% 2010:2011)
soy_new <- lehner.soybeanmold %>%
filter(year == 2012)
inspect_cor(soy_old, soy_new) %>%
show_plot()

16/31

Sanity check your data

17/31

Case study 2 Employment Data in Australia Part 1/3

Below is the data from ABS that shows the total number of people employed in a given month from February 1976 to December 2019 using the original time series.


glimpse(employed)
## Rows: 509
## Columns: 4
## $ date <date> 1978-02-01, 1978-03-01, 1978-04-01, 1978-05-01, 1978-06-01, 1978-07-01, 1978-08-01, 1978-09-01, 1978-10-01, 1978-11-01, 1978-12-01, 1979-01-01, 1979-02-01, 1979-03-01, 1979-04-01, 197…
## $ month <dbl> 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, …
## $ year <fct> 1978, 1978, 1978, 1978, 1978, 1978, 1978, 1978, 1978, 1978, 1978, 1979, 1979, 1979, 1979, 1979, 1979, 1979, 1979, 1979, 1979, 1979, 1979, 1980, 1980, 1980, 1980, 1980, 1980, 1980, 1980…
## $ value <dbl> 5985.660, 6040.561, 6054.214, 6038.265, 6031.342, 6036.084, 6005.361, 6024.313, 6045.855, 6033.797, 6125.360, 5971.329, 6050.693, 6096.175, 6087.654, 6075.611, 6095.734, 6103.922, 6078…

Australian Bureau of Statistics, 2020, Labour force, Australia, Table 01. Labour force status by Sex, Australia - Trend, Seasonally adjusted and Original, viewed 2021-08-09,

18/31

Case study 2 Employment Data in Australia Part 2/3

Do you notice anything?

19/31

Case study 2 Employment Data in Australia Part 2/3

Do you notice anything?

Why do you think the number of people employed is going up each year?

19/31
  • Australian population is 25.39 million in 2019
  • 1.5% annual increase in population
  • Vic population is 6.681 million (Sep 2020) - 26%
  • NSW population is 8.166 (Sep 2020) - 32%

Case study 2 Employment Data in Australia Part 3/3

20/31

Case study 2 Employment Data in Australia Part 3/3

  • There's a suspicious change in August numbers from 2014.

  • A potential explanation for this is that there was a change in the survey from 2014.
Also see https://robjhyndman.com/hyndsight/abs-seasonal-adjustment-2/
20/31

Check if the data collection method has been consistent

21/31

Example 3 Experimental layout and data Part 1/2

lecture3-example3.csv

df3 <- read_csv(here::here("data/lecture3-example3.csv"),
col_types = cols(
row = col_factor(),
col = col_factor(),
yield = col_double(),
trt = col_factor(),
block = col_factor()))
skimr::skim(df3)
## ── Data Summary ────────────────────────
## Values
## Name df3
## Number of rows 48
## Number of columns 5
## _______________________
## Column type frequency:
## factor 4
## numeric 1
## ________________________
## Group variables None
##
## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate ordered n_unique top_counts
## 1 row 0 1 FALSE 6 1: 8, 2: 8, 3: 8, 4: 8
## 2 col 0 1 FALSE 8 1: 6, 2: 6, 3: 6, 4: 6
## 3 trt 0 1 FALSE 9 non: 16, hi : 4, hi : 4, hi : 4
## 4 block 0 1 FALSE 4 B3: 12, B1: 12, B2: 12, B4: 12
##
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 yield 0 1 246. 16.0 204 237 248 257. 273 ▂▂▇▇▅
22/31

Example 3 Experimental layout and data Part 2/2

  • The experiment tests the effects of 9 fertilizer treatments on the yield of brussel sprouts on a field laid out in a rectangular array of 6 rows and 8 columns.

  • High sulphur and high manure seems to best for the yield of brussel sprouts.
  • Any issues here?
23/31

Check if experimental layout given in the data and the description match

In particular, have a check with a plot to see if treatments are randomised.

24/31

Statistical Value Chain

... a statistical value chain is constructed by defining a number of meaningful intermediate data products, for which a chosen set of quality attributes are well described ...

— van der Loo & de Jonge (2018)

Schema from Mark van der Loo and Edwin de Jonge. 2018. Statistical Data Cleaning with Applications in R. John Wiley and Sons Ltd.

25/31

Case study 3 Dutch supermarket revenue and cost Part 1/3

  • Data contains the revenue and cost (in Euros) for 60 supermarkets
  • Data has been anonymised and distorted
## Rows: 60
## Columns: 11
## $ id <fct> RET01, RET02, RET03, RET04, RET05, RET06, RET07, RET08, RET09, RET10, RET11, RET12, RET13, RET14, RET15, RET16, RET17, RET18, RET19, RET20, RET21, RET22, RET23, RET24, RET25, RET…
## $ size <fct> sc0, sc3, sc3, sc3, sc3, sc0, sc3, sc1, sc3, sc2, sc2, sc2, sc3, sc1, sc1, sc0, sc3, sc1, sc2, sc3, sc0, sc0, sc1, sc1, sc2, sc3, sc2, sc3, sc0, sc2, sc3, sc2, sc3, sc3, sc3, sc3…
## $ incl.prob <dbl> 0.02, 0.14, 0.14, 0.14, 0.14, 0.02, 0.14, 0.02, 0.14, 0.05, 0.05, 0.05, 0.14, 0.02, 0.02, 0.02, 0.14, 0.02, 0.05, 0.14, 0.02, 0.02, 0.02, 0.02, 0.05, 0.14, 0.05, 0.14, 0.02, 0.05…
## $ staff <int> 75, 9, NA, NA, NA, 1, 5, 3, 6, 5, 5, 5, 13, NA, 3, 52, 10, 4, 3, 8, 2, 3, 2, 4, 3, 6, 2, 16, 1, 6, 29, 8, 13, 9, 15, 14, 6, 53, 7, NA, 20, 2, NA, 1, 3, 1, 60, 8, 10, 12, 7, 24, 2…
## $ turnover <int> NA, 1607, 6886, 3861, NA, 25, NA, 404, 2596, NA, 645, 2872, 5678, 931397, 80000, 9067, 1500, 440, 690, 1852, 359, 839, 471, 933, 1665, 2318, 1175, 2946, 492, 1831, 7271, 971, 411…
## $ other.rev <int> NA, NA, -33, 13, 37, NA, NA, 13, NA, NA, NA, NA, 12, NA, NA, 622, 20, NA, NA, NA, 9, NA, NA, 2, NA, NA, 12, 7, NA, 1831, 30, NA, 11, NA, 33, 98350, 4, NA, 38, 98, 11, NA, NA, NA,…
## $ total.rev <int> 1130, 1607, 6919, 3874, 5602, 25, 1335, 417, 2596, NA, 645, 2872, 5690, 931397, NA, 9689, 1520, 440, 690, 1852, 368, 839, 471, 935, 1665, 2318, 1187, 2953, 492, 1831, 7301, 107, …
## $ staff.costs <int> NA, 131, 324, 290, 314, NA, 135, NA, 147, NA, 130, 182, 326, 36872, 40000, 1125, 195, 16, 19000, 120, NA, 2, 34, 31, 70, 184, 114, 245, NA, 53, 451, 28, 57, 106, 539, 221302, 64,…
## $ total.costs <int> 18915, 1544, 6493, 3600, 5530, 22, 136, 342, 2486, NA, 636, 2652, 5656, 841489, NA, 9911, 1384, 379, 464507, 1812, 339, 717, 411, 814, 186, 390, NA, 2870, 470, 1443, 7242, 95, 36…
## $ profit <int> 20045, 63, 426, 274, 72, 3, 1, 75, 110, NA, 9, 220, 34, 89908, NA, -222, 136, 60, 225493, 40, 29, 122, 60, 121, 1478, 86, 17, 83, 22, 388, 59, 100, 528, 160, 282, 22457, 37, -160…
## $ vat <int> NA, NA, NA, NA, NA, NA, 1346, NA, NA, NA, NA, NA, NA, 863, 813, 964, 733, 296, 486, 1312, 257, 654, 377, 811, 1472, 2082, 1058, 2670, 449, 1695, 6754, 905, 3841, 2668, 2758, 2548…
26/31

Case study 3 Dutch supermarket revenue and cost Part 2/3

  • Checking for completeness of records
library(validate)
rules <- validator(
is_complete(id),
is_complete(id, turnover),
is_complete(id, turnover, profit))
out <- confront(SBS2000, rules)
summary(out)
## name items passes fails nNA error warning expression
## 1 V1 60 60 0 0 FALSE FALSE is_complete(id)
## 2 V2 60 56 4 0 FALSE FALSE is_complete(id, turnover)
## 3 V3 60 52 8 0 FALSE FALSE is_complete(id, turnover, profit)
27/31

Case study 3 Dutch supermarket revenue and cost Part 3/3

  • Sanity check derived variables
library(validate)
rules <- validator(
total.rev - profit == total.costs,
turnover + other.rev == total.rev,
profit <= 0.6 * total.rev
)
out <- confront(SBS2000, rules)
summary(out)
## name items passes fails nNA error warning expression
## 1 V1 60 39 14 7 FALSE FALSE abs(total.rev - profit - total.costs) < 1e-08
## 2 V2 60 19 4 37 FALSE FALSE abs(turnover + other.rev - total.rev) < 1e-08
## 3 V3 60 49 6 5 FALSE FALSE (profit - 0.6 * total.rev) <= 1e-08
28/31

Take away messages

29/31

Take away messages

  • Sanity check your data:
    • by validating the variable types
    • with independent or external sources
    • by checking the data quality
29/31

Take away messages

  • Sanity check your data:
    • by validating the variable types
    • with independent or external sources
    • by checking the data quality
  • Check if the data collection method has been consistent
29/31

Take away messages

  • Sanity check your data:
    • by validating the variable types
    • with independent or external sources
    • by checking the data quality
  • Check if the data collection method has been consistent
  • Check if experimental layout given in the data and the description match
29/31

Take away messages

  • Sanity check your data:
    • by validating the variable types
    • with independent or external sources
    • by checking the data quality
  • Check if the data collection method has been consistent
  • Check if experimental layout given in the data and the description match
  • Consider if or how data were derived for further sanity check of your data
29/31

Next we'll have a look at the
2 Model formulation

30/31

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Lecturer: Emi Tanaka

ETC5521.Clayton-x@monash.edu

Week 3 - Session 1


31/31

ETC5521: Exploratory Data Analysis


Initial data analysis

Lecturer: Emi Tanaka

ETC5521.Clayton-x@monash.edu

Week 3 - Session 1


1/31
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow