These slides are viewed best by Chrome or Firefox and occasionally need to be refreshed if elements did not load properly. See here for the PDF .
Press the right arrow to progress to the next slide!
Lecturer: Emi Tanaka
ETC5521.Clayton-x@monash.edu
Week 3 - Session 1
Data analysis is a process of cleaning, transforming, inspecting and modelling data with the aim of extracting information.
Data analysis is a process of cleaning, transforming, inspecting and modelling data with the aim of extracting information.
Data analysis is a process of cleaning, transforming, inspecting and modelling data with the aim of extracting information.
Chatfield (1985) The Initial Examination of Data. Journal of the Royal Statistical Society. Series A (General) 148
Cox & Snell (1981) Applied Statistics. London: Chapman and Hall.
Rao (1983) Optimum balance between statistical theory and application in teaching. Proc. of the First Int Conference on Teaching Statistics 34-49
Chatfield (1985) The Initial Examination of Data. Journal of the Royal Statistical Society. Series A (General) 148
Cox & Snell (1981) Applied Statistics. London: Chapman and Hall.
Rao (1983) Optimum balance between statistical theory and application in teaching. Proc. of the First Int Conference on Teaching Statistics 34-49
So what is IDA?
The two main objectives for IDA are:
The two main objectives for IDA are:
The two main objectives for IDA are:
The two main objectives for IDA are:
The two main objectives for IDA are:
These include using common or domain knowledge to check if the recorded data have sensible values. E.g.
Are positive values, e.g. height and weight, recorded as positive values with a plausible range?
If the data are counts, do the recorded values contain non-integer values?
These include using common or domain knowledge to check if the recorded data have sensible values. E.g.
Are positive values, e.g. height and weight, recorded as positive values with a plausible range?
If the data are counts, do the recorded values contain non-integer values?
For compositional data, do the values add up to 100% (or 1)? If not is that a measurement error or due to rounding? Or is another variable missing?
Data sniffing or data scrutinizing is a process that you get better at with practice and have familiarity with the domain area.
Aside from checking the data structure or data quality, it's important to check how the data are understood by the computer, i.e. checking for data type is also important. E.g.,
lecture3-example.xlsx
library(readxl)library(here)df <- read_excel(here("data/lecture3-example.xlsx"))df
## # A tibble: 5 x 4## id date loc temp## <dbl> <dttm> <chr> <dbl>## 1 1 2010-01-03 00:00:00 New York 42 ## 2 2 2010-02-03 00:00:00 New York 41.4## 3 3 2010-03-03 00:00:00 New York 38.5## 4 4 2010-04-03 00:00:00 New York 41.1## 5 5 2010-05-03 00:00:00 New York 39.8
Any issues here?
library(lubridate)df %>% mutate(id = as.factor(id), day = day(date), month = month(date), year = year(date)) %>% select(-date)
## # A tibble: 5 x 6## id loc temp day month year## <fct> <chr> <dbl> <int> <dbl> <dbl>## 1 1 New York 42 3 1 2010## 2 2 New York 41.4 3 2 2010## 3 3 New York 38.5 3 3 2010## 4 4 New York 41.1 3 4 2010## 5 5 New York 39.8 3 5 2010
id
is now a factor
instead of integer
day
, month
and year
are now extracted from the date
library(lubridate)df %>% mutate(id = as.factor(id), day = day(date), month = month(date), year = year(date)) %>% select(-date)
## # A tibble: 5 x 6## id loc temp day month year## <fct> <chr> <dbl> <int> <dbl> <dbl>## 1 1 New York 42 3 1 2010## 2 2 New York 41.4 3 2 2010## 3 3 New York 38.5 3 3 2010## 4 4 New York 41.1 3 4 2010## 5 5 New York 39.8 3 5 2010
id
is now a factor
instead of integer
day
, month
and year
are now extracted from the date
library(lubridate)df %>% mutate(id = as.factor(id), day = day(date), month = month(date), year = year(date)) %>% select(-date)
## # A tibble: 5 x 6## id loc temp day month year## <fct> <chr> <dbl> <int> <dbl> <dbl>## 1 1 New York 42 3 1 2010## 2 2 New York 41.4 3 2 2010## 3 3 New York 38.5 3 3 2010## 4 4 New York 41.1 3 4 2010## 5 5 New York 39.8 3 5 2010
id
is now a factor
instead of integer
day
, month
and year
are now extracted from the date
library(lubridate)df %>% mutate(id = as.factor(id), day = day(date), month = month(date), year = year(date)) %>% select(-date)
## # A tibble: 5 x 6## id loc temp day month year## <fct> <chr> <dbl> <int> <dbl> <dbl>## 1 1 New York 42 3 1 2010## 2 2 New York 41.4 3 2 2010## 3 3 New York 38.5 3 3 2010## 4 4 New York 41.1 3 4 2010## 5 5 New York 39.8 3 5 2010
id
is now a factor
instead of integer
day
, month
and year
are now extracted from the date
xlsx_df <- read_excel(here("data/lecture3-example.xlsx"), col_types = c("text", "date", "text", "numeric")) %>% mutate(id = as.factor(id), date = as.character(date), date = as.Date(date, format = "%Y-%d-%m"))
read_csv
has a broader support for col_types
csv_df <- read_csv(here("data/lecture3-example.csv"), col_types = cols( id = col_factor(), date = col_date(format = "%m/%d/%y"), loc = col_character(), temp = col_double()))
You can have a quick glimpse of the data type with:
dplyr::glimpse(xlsx_df)
## Rows: 5## Columns: 4## $ id <fct> 1, 2, 3, 4, 5## $ date <date> 2010-03-01, 2010-03-02, 2010-03-03, 2010-03-04, 2010-03-05## $ loc <chr> "New York", "New York", "New York", "New York", "New York"## $ temp <dbl> 42.0, 41.4, 38.5, 41.1, 39.8
dplyr::glimpse(csv_df)
## Rows: 5## Columns: 4## $ id <fct> 1, 2, 3, 4, 5## $ date <date> 2010-03-01, 2010-03-02, 2010-03-03, 2010-03-04, 2010-03-05## $ loc <chr> "New York", "New York", "New York", "New York", "New York"## $ temp <dbl> 42.0, 41.4, 38.5, 41.1, 39.8
You can also visualise the data type with:
library(visdat)vis_dat(xlsx_df)
library(inspectdf)inspect_types(xlsx_df) %>% show_plot()
df2 <- read_csv(here("data/lecture3-example2.csv"), col_types = cols(id = col_factor(), date = col_date(format = "%m/%d/%y"), loc = col_character(), temp = col_double()))df2
## # A tibble: 9 x 4## id date loc temp## <fct> <date> <chr> <dbl>## 1 1 2010-03-01 New York 42 ## 2 2 2010-03-02 New York 41.4## 3 3 2010-03-03 New York 38.5## 4 4 2010-03-04 New York 41.1## 5 5 2010-03-05 New York 39.8## 6 6 2020-03-01 Melbourne 30.6## 7 7 2020-03-02 Melbourne 17.9## 8 8 2020-03-03 Melbourne 18.6## 9 9 2020-03-04 <NA> 21.3
df2 <- read_csv(here("data/lecture3-example2.csv"), col_types = cols(id = col_factor(), date = col_date(format = "%m/%d/%y"), loc = col_character(), temp = col_double()))df2
## # A tibble: 9 x 4## id date loc temp## <fct> <date> <chr> <dbl>## 1 1 2010-03-01 New York 42 ## 2 2 2010-03-02 New York 41.4## 3 3 2010-03-03 New York 38.5## 4 4 2010-03-04 New York 41.1## 5 5 2010-03-05 New York 39.8## 6 6 2020-03-01 Melbourne 30.6## 7 7 2020-03-02 Melbourne 17.9## 8 8 2020-03-03 Melbourne 18.6## 9 9 2020-03-04 <NA> 21.3
loc
.data("lehner.soybeanmold", package = "agridat")skimr::skim(lehner.soybeanmold)
## ── Data Summary ────────────────────────## Values ## Name lehner.soybeanmold## Number of rows 382 ## Number of columns 9 ## _______________________ ## Column type frequency: ## factor 4 ## numeric 5 ## ________________________ ## Group variables None ## ## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────## skim_variable n_missing complete_rate ordered n_unique top_counts ## 1 study 0 1 FALSE 35 S01: 13, S02: 13, S03: 13, S04: 13## 2 loc 0 1 FALSE 14 Sao: 56, Mon: 44, Pon: 44, Mau: 34## 3 region 0 1 FALSE 2 Nor: 273, Sou: 109 ## 4 trt 0 1 FALSE 13 T01: 35, T02: 35, T03: 35, T04: 35## ## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 year 0 1 2010. 0.981 2009 2010 2010 2011 2012 ▃▇▁▃▃## 2 elev 0 1 951. 83.6 737 909 947 1027 1050 ▁▂▅▂▇## 3 yield 0 1 3117. 764. 1451 2528. 2994. 3642. 4908 ▂▇▇▃▃## 4 mold 0 1 19.7 18.1 0 7 13 27.2 90.3 ▇▃▂▁▁## 5 sclerotia 66 0.827 1854. 2034. 0 421 1082 2606 11000 ▇▂▁▁▁
Lehner, M. S., Pethybridge, S. J., Meyer, M. C., & Del Ponte, E. M. (2016). Meta-analytic modelling of the incidence-yield and incidence-sclerotial production relationships in soybean white mould epidemics. Plant Pathology. doi:10.1111/ppa.12590
vis_miss(lehner.soybeanmold)
inspect_na(lehner.soybeanmold) %>% show_plot()
Checking if missing values have different yields:
library(naniar)ggplot(lehner.soybeanmold, aes(sclerotia, yield)) + geom_miss_point() + scale_color_discrete_qualitative()
Compare the new with old data:
soy_old <- lehner.soybeanmold %>% filter(year %in% 2010:2011)soy_new <- lehner.soybeanmold %>% filter(year == 2012)inspect_cor(soy_old, soy_new) %>% show_plot()
Below is the data from ABS that shows the total number of people employed in a given month from February 1976 to December 2019 using the original time series.
glimpse(employed)
## Rows: 509## Columns: 4## $ date <date> 1978-02-01, 1978-03-01, 1978-04-01, 1978-05-01, 1978-06-01, 1978-07-01, 1978-08-01, 1978-09-01, 1978-10-01, 1978-11-01, 1978-12-01, 1979-01-01, 1979-02-01, 1979-03-01, 1979-04-01, 197…## $ month <dbl> 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, …## $ year <fct> 1978, 1978, 1978, 1978, 1978, 1978, 1978, 1978, 1978, 1978, 1978, 1979, 1979, 1979, 1979, 1979, 1979, 1979, 1979, 1979, 1979, 1979, 1979, 1980, 1980, 1980, 1980, 1980, 1980, 1980, 1980…## $ value <dbl> 5985.660, 6040.561, 6054.214, 6038.265, 6031.342, 6036.084, 6005.361, 6024.313, 6045.855, 6033.797, 6125.360, 5971.329, 6050.693, 6096.175, 6087.654, 6075.611, 6095.734, 6103.922, 6078…
Australian Bureau of Statistics, 2020, Labour force, Australia, Table 01. Labour force status by Sex, Australia - Trend, Seasonally adjusted and Original, viewed 2021-08-09,
Do you notice anything?
Do you notice anything?
Why do you think the number of people employed is going up each year?
lecture3-example3.csv
df3 <- read_csv(here::here("data/lecture3-example3.csv"), col_types = cols( row = col_factor(), col = col_factor(), yield = col_double(), trt = col_factor(), block = col_factor()))
skimr::skim(df3)
## ── Data Summary ────────────────────────## Values## Name df3 ## Number of rows 48 ## Number of columns 5 ## _______________________ ## Column type frequency: ## factor 4 ## numeric 1 ## ________________________ ## Group variables None ## ## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────## skim_variable n_missing complete_rate ordered n_unique top_counts ## 1 row 0 1 FALSE 6 1: 8, 2: 8, 3: 8, 4: 8 ## 2 col 0 1 FALSE 8 1: 6, 2: 6, 3: 6, 4: 6 ## 3 trt 0 1 FALSE 9 non: 16, hi : 4, hi : 4, hi : 4## 4 block 0 1 FALSE 4 B3: 12, B1: 12, B2: 12, B4: 12 ## ## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 yield 0 1 246. 16.0 204 237 248 257. 273 ▂▂▇▇▅
In particular, have a check with a plot to see if treatments are randomised.
... a statistical value chain is constructed by defining a number of meaningful intermediate data products, for which a chosen set of quality attributes are well described ...
— van der Loo & de Jonge (2018)
Schema from Mark van der Loo and Edwin de Jonge. 2018. Statistical Data Cleaning with Applications in R. John Wiley and Sons Ltd.
## Rows: 60## Columns: 11## $ id <fct> RET01, RET02, RET03, RET04, RET05, RET06, RET07, RET08, RET09, RET10, RET11, RET12, RET13, RET14, RET15, RET16, RET17, RET18, RET19, RET20, RET21, RET22, RET23, RET24, RET25, RET…## $ size <fct> sc0, sc3, sc3, sc3, sc3, sc0, sc3, sc1, sc3, sc2, sc2, sc2, sc3, sc1, sc1, sc0, sc3, sc1, sc2, sc3, sc0, sc0, sc1, sc1, sc2, sc3, sc2, sc3, sc0, sc2, sc3, sc2, sc3, sc3, sc3, sc3…## $ incl.prob <dbl> 0.02, 0.14, 0.14, 0.14, 0.14, 0.02, 0.14, 0.02, 0.14, 0.05, 0.05, 0.05, 0.14, 0.02, 0.02, 0.02, 0.14, 0.02, 0.05, 0.14, 0.02, 0.02, 0.02, 0.02, 0.05, 0.14, 0.05, 0.14, 0.02, 0.05…## $ staff <int> 75, 9, NA, NA, NA, 1, 5, 3, 6, 5, 5, 5, 13, NA, 3, 52, 10, 4, 3, 8, 2, 3, 2, 4, 3, 6, 2, 16, 1, 6, 29, 8, 13, 9, 15, 14, 6, 53, 7, NA, 20, 2, NA, 1, 3, 1, 60, 8, 10, 12, 7, 24, 2…## $ turnover <int> NA, 1607, 6886, 3861, NA, 25, NA, 404, 2596, NA, 645, 2872, 5678, 931397, 80000, 9067, 1500, 440, 690, 1852, 359, 839, 471, 933, 1665, 2318, 1175, 2946, 492, 1831, 7271, 971, 411…## $ other.rev <int> NA, NA, -33, 13, 37, NA, NA, 13, NA, NA, NA, NA, 12, NA, NA, 622, 20, NA, NA, NA, 9, NA, NA, 2, NA, NA, 12, 7, NA, 1831, 30, NA, 11, NA, 33, 98350, 4, NA, 38, 98, 11, NA, NA, NA,…## $ total.rev <int> 1130, 1607, 6919, 3874, 5602, 25, 1335, 417, 2596, NA, 645, 2872, 5690, 931397, NA, 9689, 1520, 440, 690, 1852, 368, 839, 471, 935, 1665, 2318, 1187, 2953, 492, 1831, 7301, 107, …## $ staff.costs <int> NA, 131, 324, 290, 314, NA, 135, NA, 147, NA, 130, 182, 326, 36872, 40000, 1125, 195, 16, 19000, 120, NA, 2, 34, 31, 70, 184, 114, 245, NA, 53, 451, 28, 57, 106, 539, 221302, 64,…## $ total.costs <int> 18915, 1544, 6493, 3600, 5530, 22, 136, 342, 2486, NA, 636, 2652, 5656, 841489, NA, 9911, 1384, 379, 464507, 1812, 339, 717, 411, 814, 186, 390, NA, 2870, 470, 1443, 7242, 95, 36…## $ profit <int> 20045, 63, 426, 274, 72, 3, 1, 75, 110, NA, 9, 220, 34, 89908, NA, -222, 136, 60, 225493, 40, 29, 122, 60, 121, 1478, 86, 17, 83, 22, 388, 59, 100, 528, 160, 282, 22457, 37, -160…## $ vat <int> NA, NA, NA, NA, NA, NA, 1346, NA, NA, NA, NA, NA, NA, 863, 813, 964, 733, 296, 486, 1312, 257, 654, 377, 811, 1472, 2082, 1058, 2670, 449, 1695, 6754, 905, 3841, 2668, 2758, 2548…
library(validate)rules <- validator( is_complete(id), is_complete(id, turnover), is_complete(id, turnover, profit))out <- confront(SBS2000, rules)summary(out)
## name items passes fails nNA error warning expression## 1 V1 60 60 0 0 FALSE FALSE is_complete(id)## 2 V2 60 56 4 0 FALSE FALSE is_complete(id, turnover)## 3 V3 60 52 8 0 FALSE FALSE is_complete(id, turnover, profit)
library(validate)rules <- validator( total.rev - profit == total.costs, turnover + other.rev == total.rev, profit <= 0.6 * total.rev)out <- confront(SBS2000, rules)summary(out)
## name items passes fails nNA error warning expression## 1 V1 60 39 14 7 FALSE FALSE abs(total.rev - profit - total.costs) < 1e-08## 2 V2 60 19 4 37 FALSE FALSE abs(turnover + other.rev - total.rev) < 1e-08## 3 V3 60 49 6 5 FALSE FALSE (profit - 0.6 * total.rev) <= 1e-08
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Lecturer: Emi Tanaka
ETC5521.Clayton-x@monash.edu
Week 3 - Session 1
Lecturer: Emi Tanaka
ETC5521.Clayton-x@monash.edu
Week 3 - Session 1
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |