ETC5521: Exploratory Data Analysis

.info-box.w-50.bg-white[
These slides are viewed best by Chrome or Firefox and occasionally need to be refreshed if elements did not load properly. See <a href=lecture-03A.pdf>here for the PDF </a>. 
]

---

# .monash-blue[ETC5521: Exploratory Data Analysis]

<h2 style="font-weight:900!important;">Initial data analysis</h2>

.bottom_abs.width100[

Lecturer: *Emi Tanaka*

ETC5521.Clayton-x@monash.edu

Week 3 - Session 1

]

---

# Data Analysis

.info-box.w-60[
**Data analysis** is a process of cleaning, transforming, inspecting and modelling data with the aim of extracting information.
]

* Data analysis includes:
  * exploratory data analysis,
  * confirmatory data analysis, and 
  * _initial data analysis_.
  
--

.w-60[
* Confirmatory data analysis is focussed on statistical inference and includes processes such as testing hypothesis, model selection, or predictive modelling... {{content}}

]

--
but today's focus will be on **_initial data analysis_**.

---

# Initial Data Analysis (IDA)

.w-60[
* There are various definitions of IDA, much like there are numerous definitions for EDA.
* Some people would be practicing IDA without realising that it is IDA.
* Or other cases, a different name is used to describe the same process, such as Chatfield (1985) referring to IDA also as **_"initial examination of data"_** and Cox & Snell (1981) as **_"preliminary data anlysis"_** and Rao (1983) as **_"cross-examination of data"_**.

]

.footnote.f5[
Chatfield (1985) The Initial Examination of Data. *Journal of the Royal Statistical Society. Series A (General)* **148** 
Cox & Snell (1981) Applied Statistics. *London: Chapman and Hall.* 
Rao (1983) Optimum balance between statistical theory and application in teaching. *Proc. of the First Int Conference on Teaching Statistics* 34-49
]

.w-60.center.monash-blue.f1[
**So what is IDA?**
]

---

# What is IDA?

<div style="padding-left: 40px;">
<ol>
<li> data description, and</li> 
<li> model formulation.</li>
</ol>
</div>
]
--

.w-60[
* **_IDA differs from the main analysis_** (i.e. usually fitting the model, conducting significance tests, making inferences or predictions). 
{{content}}
]
--

* **_IDA is often unreported_** in the data analysis reports or scientific papers due to it being "uninteresting" or "obvious".
{{content}}
--

* The role of **_the main analysis is to answer the intended question(s) that the data were collected for_**.
--

* Sometimes IDA alone is sufficient.

---

# .circle.bg-black.white[1] Data Description .f4[Part 1/2]

.w-70[
* Data description should be one of the first steps in the data analysis to **_assess the structure and quality of the data_**.
{{content}}
]
--

* We refer them to occasionally as **_data sniffing_** or **_data scrutinizing_**.
{{content}}
--

* These include using common or domain knowledge to check if the recorded data have sensible values. 
{{content}}
--
E.g. 
  * Are positive values, e.g. height and weight, recorded as positive values with a plausible range?
{{content}}
--

* If the data are counts, do the recorded values contain non-integer values?
{{content}}
--

* For compositional data, do the values add up to 100% (or 1)? If not is that a measurement error or due to rounding? Or is another variable missing?

---

# .circle.bg-black.white[1] Data Description .f4[Part 2/2]

.w-70[
* In addition, numerical or graphical summaries may reveal that there is unwanted structure in the data. E.g.,
    * Does the treatment group have different demographic characteristics to the control group? 
    * Does the distribution of the data imply violations of assumptions for the main analysis?
{{content}}
]
--
* *Data sniffing* or *data scrutinizing* is a process that you get better at with practice and have familiarity with the domain area. 
{{content}}
--

* Aside from checking the _data structure_ or _data quality_, it's important to check how the data are understood by the computer, i.e. checking for _data type_ is also important. E.g.,
    * Was the date read in as character?
    * Was a factor read in as numeric?
    
    
---

# Next we'll see some _illustrative .blue[examples]_  and _.orange[cases] based on real data_ with some R codes

* Note: that there are a variety of ways to do IDA & EDA and you don't need to prescribe to what we show you.

]

---

# .blue[Example] .circle.bg-blue.white[1] Checking the data type .f4[Part 1/2]

]
.item.pl2[

```r
library(readxl)
library(here)
df <- read_excel(here("data/lecture3-example.xlsx"))
df
```

```
## # A tibble: 5 x 4
## id date loc temp
## <dbl> <dttm> <chr> <dbl>
## 1 1 2010-01-03 00:00:00 New York 42 
## 2 2 2010-02-03 00:00:00 New York 41.4
## 3 3 2010-03-03 00:00:00 New York 38.5
## 4 4 2010-04-03 00:00:00 New York 41.1
## 5 5 2010-05-03 00:00:00 New York 39.8
```

Any issues here?

]
]

---

# .blue[Example] .circle.bg-blue.white[1] Checking the data type .f4[Part 2/2]

```r
library(lubridate)
df %>% 
  mutate(id = as.factor(id),
         day = day(date),
         month = month(date),
         year = year(date)) %>% 
  select(-date)
```

```
## # A tibble: 5 x 6
## id loc temp day month year
## <fct> <chr> <dbl> <int> <dbl> <dbl>
## 1 1 New York 42 3 1 2010
## 2 2 New York 41.4 3 2 2010
## 3 3 New York 38.5 3 3 2010
## 4 4 New York 41.1 3 4 2010
## 5 5 New York 39.8 3 5 2010
```

]
.item[
* `id` is now a `factor` instead of `integer`
* `day`, `month` and `year` are now extracted from the `date`
* Is it okay now?

]
]

* In the United States, it's common to use the date format MM/DD/YYYY <a class="font_small black" href="https://twitter.com/statsgen/status/1257959369448161281">(gasps)</a> while the rest of the world commonly use DD/MM/YYYY or YYYY/MM/DD.
{{content}}
--

* It's highly probable that the dates are 1st-5th March and not 3rd of Jan-May.
{{content}}
--

* You can validate this with other variables, say the temperature [here](https://www.wunderground.com/history/monthly/us/ny/new-york-city/KLGA/date/2010-3).

---

# .blue[Example] .circle.bg-blue.white[1] Checking the data type with R .f4[Part 1/3]

* You can robustify your workflow by ensuring you have a check for the expected data type in your code.

```r
xlsx_df <- read_excel(here("data/lecture3-example.xlsx"),
 col_types = c("text", "date", "text", "numeric")) %>% 
 mutate(id = as.factor(id), 
 date = as.character(date),
 date = as.Date(date, format = "%Y-%d-%m"))
```
]

* `read_csv` has a broader support for `col_types`

```r
csv_df <- read_csv(here("data/lecture3-example.csv"),
 col_types = cols(
 id = col_factor(),
 date = col_date(format = "%m/%d/%y"),
 loc = col_character(),
 temp = col_double()))
```
]

* The checks (or coercions) ensure that even if the data are updated, you can have some confidence that any data type error will be picked up before further analysis.

---

# .blue[Example] .circle.bg-blue.white[1] Checking the data type with R .f4[Part 2/3]

You can have a quick glimpse of the data type with:

```r
dplyr::glimpse(xlsx_df)
```

```
## Rows: 5
## Columns: 4
## $ id <fct> 1, 2, 3, 4, 5
## $ date <date> 2010-03-01, 2010-03-02, 2010-03-03, 2010-03-04, 2010-03-05
## $ loc <chr> "New York", "New York", "New York", "New York", "New York"
## $ temp <dbl> 42.0, 41.4, 38.5, 41.1, 39.8
```

```r
dplyr::glimpse(csv_df)
```

---

# .blue[Example] .circle.bg-blue.white[1] Checking the data type with R .f4[Part 3/3]

You can also visualise the data type with:

```r
library(visdat)
vis_dat(xlsx_df)
```

]
.item[

```r
library(inspectdf)
inspect_types(xlsx_df) %>% 
  show_plot()
```

]
]

---

# .blue[Example] .circle.bg-blue.white[2] Checking the data quality

```r
df2 <- read_csv(here("data/lecture3-example2.csv"),
 col_types = cols(id = col_factor(),
 date = col_date(format = "%m/%d/%y"),
 loc = col_character(),
 temp = col_double()))
df2
```

```
## # A tibble: 9 x 4
## id date loc temp
## <fct> <date> <chr> <dbl>
## 1 1 2010-03-01 New York 42 
## 2 2 2010-03-02 New York 41.4
## 3 3 2010-03-03 New York 38.5
## 4 4 2010-03-04 New York 41.1
## 5 5 2010-03-05 New York 39.8
## 6 6 2020-03-01 Melbourne 30.6
## 7 7 2020-03-02 Melbourne 17.9
## 8 8 2020-03-03 Melbourne 18.6
## 9 9 2020-03-04 <NA> 21.3
```

]
.item[
* Numerical or graphical summaries or even just eye-balling the data helps to uncover some data quality issues.
* Any issues here?
{{content}}

]
]

--
 
* There's a missing value in `loc`.
* Temperature is in Farenheit for New York but Celsius in Melbourne (you can validate this again using external sources).

---

# .orange[Case study] .circle.bg-orange.white[1]  Soybean study in Brazil .f4[Part 1/3]

.overflow-scroll.h-70.f4[

```r
data("lehner.soybeanmold", package = "agridat")
*skimr::skim(lehner.soybeanmold)
```

```
## ── Data Summary ────────────────────────
##                            Values            
## Name                       lehner.soybeanmold
## Number of rows             382               
## Number of columns          9                 
## _______________________                      
## Column type frequency:                       
##   factor                   4                 
##   numeric                  5                 
## ________________________                     
## Group variables            None              
## 
## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate ordered n_unique top_counts                        
## 1 study                 0             1 FALSE         35 S01: 13, S02: 13, S03: 13, S04: 13
## 2 loc                   0             1 FALSE         14 Sao: 56, Mon: 44, Pon: 44, Mau: 34
## 3 region                0             1 FALSE          2 Nor: 273, Sou: 109                
## 4 trt                   0             1 FALSE         13 T01: 35, T02: 35, T03: 35, T04: 35
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate   mean       sd    p0   p25   p50    p75    p100 hist 
## 1 year                  0         1     2010.     0.981  2009 2010  2010  2011    2012   ▃▇▁▃▃
## 2 elev                  0         1      951.    83.6     737  909   947  1027    1050   ▁▂▅▂▇
## 3 yield                 0         1     3117.   764.     1451 2528. 2994. 3642.   4908   ▂▇▇▃▃
## 4 mold                  0         1       19.7   18.1       0    7    13    27.2    90.3 ▇▃▂▁▁
## 5 sclerotia            66         0.827 1854.  2034.        0  421  1082  2606   11000   ▇▂▁▁▁
```
]

<center>
scroll 

</center>

.footnote.f4[
Lehner, M. S., Pethybridge, S. J., Meyer, M. C., & Del Ponte, E. M. (2016). Meta-analytic modelling of the incidence-yield and incidence-sclerotial production relationships in soybean white mould epidemics. _Plant Pathology_. doi:10.1111/ppa.12590
]

---

# .orange[Case study] .circle.bg-orange.white[1]  Soybean study in Brazil .f4[Part 2/3]

```r
vis_miss(lehner.soybeanmold)
```

<img src="images/week3A/unnamed-chunk-12-1.png" width="432" style="display: block; margin: auto;" />
]
.item[

```r
inspect_na(lehner.soybeanmold) %>% 
  show_plot()
```

]]

---

# .orange[Case study] .circle.bg-orange.white[1]  Soybean study in Brazil .f4[Part 3/3]

Checking if missing values have different yields:

```r
*library(naniar)
ggplot(lehner.soybeanmold, 
       aes(sclerotia, yield)) + 
* geom_miss_point() +
  scale_color_discrete_qualitative()
```

<img src="images/week3A/unnamed-chunk-14-1.png" width="432" style="display: block; margin: auto;" />
]
.item.pl3[
Compare the new with old data:

```r
soy_old <- lehner.soybeanmold %>% 
 filter(year %in% 2010:2011)
soy_new <- lehner.soybeanmold %>% 
 filter(year == 2012)

*inspect_cor(soy_old, soy_new) %>%
* show_plot()
```

]
]

---

# Sanity check your data

---

# .orange[Case study] .circle.bg-orange.white[2] Employment Data in Australia .f4[Part 1/3]

Below is the data from ABS that shows the total number of people employed in a given month from February 1976 to December 2019 using the original time series.

```r
glimpse(employed)
```

```
## Rows: 509
## Columns: 4
## $ date <date> 1978-02-01, 1978-03-01, 1978-04-01, 1978-05-01, 1978-06-01, 1978-07-01, 1978-08-01, 1978-09-01, 1978-10-01, 1978-11-01, 1978-12-01, 1979-01-01, 1979-02-01, 1979-03-01, 1979-04-01, 197…
## $ month <dbl> 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, …
## $ year <fct> 1978, 1978, 1978, 1978, 1978, 1978, 1978, 1978, 1978, 1978, 1978, 1979, 1979, 1979, 1979, 1979, 1979, 1979, 1979, 1979, 1979, 1979, 1979, 1980, 1980, 1980, 1980, 1980, 1980, 1980, 1980…
## $ value <dbl> 5985.660, 6040.561, 6054.214, 6038.265, 6031.342, 6036.084, 6005.361, 6024.313, 6045.855, 6033.797, 6125.360, 5971.329, 6050.693, 6096.175, 6087.654, 6075.611, 6095.734, 6103.922, 6078…
```

.footnote.f4[
Australian Bureau of Statistics, 2020, Labour force, Australia, Table 01. Labour force status by Sex, Australia - Trend, Seasonally adjusted and Original, viewed 2021-08-09, [](https://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/6202.0Jul%202020?OpenDocument)
]

---

# .orange[Case study] .circle.bg-orange.white[2]  Employment Data in Australia .f4[Part 2/3]

Do you notice anything?

Why do you think the number of people employed is going up each year?

???

* Australian population is **25.39 million** in 2019
* 1.5% annual increase in population
* Vic population is 6.681 million (Sep 2020) - 26%
* NSW population is 8.166 (Sep 2020) - 32%

---

# .orange[Case study] .circle.bg-orange.white[2]  Employment Data in Australia .f4[Part 3/3]

.grid[.item[
<img src="images/week3A/unnamed-chunk-19-1.png" width="432" style="display: block; margin: auto;" />

]
.item[

]
]

* There's a suspicious change in August numbers from 2014.

* A potential explanation for this is that there was a _change in the survey from 2014_.

<div class="footnote">
Also see https://robjhyndman.com/hyndsight/abs-seasonal-adjustment-2/
</div>

---

# Check if the _data collection_ method has been consistent

---

# .blue[Example] .circle.bg-blue.white[3] Experimental layout and data .f4[Part 1/2]

`lecture3-example3.csv`
.f4[

```r
df3 <- read_csv(here::here("data/lecture3-example3.csv"),
 col_types = cols(
 row = col_factor(),
 col = col_factor(),
 yield = col_double(),
 trt = col_factor(),
 block = col_factor()))
```
.overflow-scroll.h5[

```r
skimr::skim(df3)
```

```
## ── Data Summary ────────────────────────
##                            Values
## Name                       df3   
## Number of rows             48    
## Number of columns          5     
## _______________________          
## Column type frequency:           
##   factor                   4     
##   numeric                  1     
## ________________________         
## Group variables            None  
## 
## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate ordered n_unique top_counts                     
## 1 row                   0             1 FALSE          6 1: 8, 2: 8, 3: 8, 4: 8         
## 2 col                   0             1 FALSE          8 1: 6, 2: 6, 3: 6, 4: 6         
## 3 trt                   0             1 FALSE          9 non: 16, hi : 4, hi : 4, hi : 4
## 4 block                 0             1 FALSE          4 B3: 12, B1: 12, B2: 12, B4: 12 
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate  mean    sd    p0   p25   p50   p75  p100 hist 
## 1 yield                 0             1  246.  16.0   204   237   248  257.   273 ▂▂▇▇▅
```
]
]

---

# .blue[Example] .circle.bg-blue.white[3] Experimental layout and data .f4[Part 2/2]

.grid[
.item[
<img src="images/week3A/unnamed-chunk-24-1.png" width="432" style="display: block; margin: auto;" /><img src="images/week3A/unnamed-chunk-24-2.png" width="432" style="display: block; margin: auto;" />

]
.item[

* The experiment tests the effects of 9 fertilizer treatments on the yield of brussel sprouts on a field laid out in a rectangular array of 6 rows and 8 columns.

* High sulphur and high manure seems to best for the yield of brussel sprouts.
* Any issues here?

]
]

---

# Check if experimental layout given in the data and the description match

In particular, have a check with a plot to see if treatments are _randomised_.

---

# Statistical Value Chain

.blockquote[
... a **statistical value chain** is constructed by defining a number of meaningful intermediate data products, for which a chosen set of quality attributes are well described ...

.footnote.f4[
Schema from Mark van der Loo and Edwin de Jonge. 2018. Statistical Data Cleaning with Applications in R. John Wiley and Sons Ltd.
]

---

# .orange[Case study] .circle.bg-orange.white[3]  Dutch supermarket revenue and cost .f4[Part 1/3]

* Data contains the revenue and cost (in Euros) for 60 supermarkets 
* Data has been anonymised and distorted

```
## Rows: 60
## Columns: 11
## $ id <fct> RET01, RET02, RET03, RET04, RET05, RET06, RET07, RET08, RET09, RET10, RET11, RET12, RET13, RET14, RET15, RET16, RET17, RET18, RET19, RET20, RET21, RET22, RET23, RET24, RET25, RET…
## $ size <fct> sc0, sc3, sc3, sc3, sc3, sc0, sc3, sc1, sc3, sc2, sc2, sc2, sc3, sc1, sc1, sc0, sc3, sc1, sc2, sc3, sc0, sc0, sc1, sc1, sc2, sc3, sc2, sc3, sc0, sc2, sc3, sc2, sc3, sc3, sc3, sc3…
## $ incl.prob <dbl> 0.02, 0.14, 0.14, 0.14, 0.14, 0.02, 0.14, 0.02, 0.14, 0.05, 0.05, 0.05, 0.14, 0.02, 0.02, 0.02, 0.14, 0.02, 0.05, 0.14, 0.02, 0.02, 0.02, 0.02, 0.05, 0.14, 0.05, 0.14, 0.02, 0.05…
## $ staff <int> 75, 9, NA, NA, NA, 1, 5, 3, 6, 5, 5, 5, 13, NA, 3, 52, 10, 4, 3, 8, 2, 3, 2, 4, 3, 6, 2, 16, 1, 6, 29, 8, 13, 9, 15, 14, 6, 53, 7, NA, 20, 2, NA, 1, 3, 1, 60, 8, 10, 12, 7, 24, 2…
## $ turnover <int> NA, 1607, 6886, 3861, NA, 25, NA, 404, 2596, NA, 645, 2872, 5678, 931397, 80000, 9067, 1500, 440, 690, 1852, 359, 839, 471, 933, 1665, 2318, 1175, 2946, 492, 1831, 7271, 971, 411…
## $ other.rev <int> NA, NA, -33, 13, 37, NA, NA, 13, NA, NA, NA, NA, 12, NA, NA, 622, 20, NA, NA, NA, 9, NA, NA, 2, NA, NA, 12, 7, NA, 1831, 30, NA, 11, NA, 33, 98350, 4, NA, 38, 98, 11, NA, NA, NA,…
## $ total.rev <int> 1130, 1607, 6919, 3874, 5602, 25, 1335, 417, 2596, NA, 645, 2872, 5690, 931397, NA, 9689, 1520, 440, 690, 1852, 368, 839, 471, 935, 1665, 2318, 1187, 2953, 492, 1831, 7301, 107, …
## $ staff.costs <int> NA, 131, 324, 290, 314, NA, 135, NA, 147, NA, 130, 182, 326, 36872, 40000, 1125, 195, 16, 19000, 120, NA, 2, 34, 31, 70, 184, 114, 245, NA, 53, 451, 28, 57, 106, 539, 221302, 64,…
## $ total.costs <int> 18915, 1544, 6493, 3600, 5530, 22, 136, 342, 2486, NA, 636, 2652, 5656, 841489, NA, 9911, 1384, 379, 464507, 1812, 339, 717, 411, 814, 186, 390, NA, 2870, 470, 1443, 7242, 95, 36…
## $ profit <int> 20045, 63, 426, 274, 72, 3, 1, 75, 110, NA, 9, 220, 34, 89908, NA, -222, 136, 60, 225493, 40, 29, 122, 60, 121, 1478, 86, 17, 83, 22, 388, 59, 100, 528, 160, 282, 22457, 37, -160…
## $ vat <int> NA, NA, NA, NA, NA, NA, 1346, NA, NA, NA, NA, NA, NA, 863, 813, 964, 733, 296, 486, 1312, 257, 654, 377, 811, 1472, 2082, 1058, 2670, 449, 1695, 6754, 905, 3841, 2668, 2758, 2548…
```

---

# .orange[Case study] .circle.bg-orange.white[3]  Dutch supermarket revenue and cost .f4[Part 2/3]

* Checking for completeness of records

```r
library(validate)
rules <- validator(
 is_complete(id),
 is_complete(id, turnover),
 is_complete(id, turnover, profit))
out <- confront(SBS2000, rules)
summary(out)
```

```
##   name items passes fails nNA error warning                        expression
## 1   V1    60     60     0   0 FALSE   FALSE                   is_complete(id)
## 2   V2    60     56     4   0 FALSE   FALSE         is_complete(id, turnover)
## 3   V3    60     52     8   0 FALSE   FALSE is_complete(id, turnover, profit)
```

---

# .orange[Case study] .circle.bg-orange.white[3]  Dutch supermarket revenue and cost .f4[Part 3/3]

* Sanity check derived variables

```r
library(validate)
rules <- validator(
 total.rev - profit == total.costs,
 turnover + other.rev == total.rev,
 profit <= 0.6 * total.rev
)
out <- confront(SBS2000, rules)
summary(out)
```

```
## name items passes fails nNA error warning expression
## 1 V1 60 39 14 7 FALSE FALSE abs(total.rev - profit - total.costs) < 1e-08
## 2 V2 60 19 4 37 FALSE FALSE abs(turnover + other.rev - total.rev) < 1e-08
## 3 V3 60 49 6 5 FALSE FALSE (profit - 0.6 * total.rev) <= 1e-08
```

---

# Take away messages

<ul class="fa-ul">
{{content}}
</ul>

]
]

<li>Sanity check your data:
 <ul>
 <li>by validating the variable types</li>
 <li>with independent or external sources</li>
 <li>by checking the data quality</li>
 </ul>
</li>

--
 <li>Check if the data collection method has been consistent</li>
{{content}}

--
 <li>Check if experimental layout given in the data and the description match</li>
{{content}}

--
 <li>Consider if or how data were derived for further sanity check of your data</li>

---

# Next we'll have a look at the 2 Model formulation

---

background-size: cover
class: title-slide
background-image: url("images/bg-01.png")

<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a> This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.

.bottom_abs.width100[

Lecturer: *Emi Tanaka*

ETC5521.Clayton-x@monash.edu

Week 3 - Session 1

]