Processing math: 100%
+ - 0:00:00
Notes for current slide
Notes for next slide

These slides are viewed best by Chrome or Firefox and occasionally need to be refreshed if elements did not load properly. See here for the PDF .


Press the right arrow to progress to the next slide!

1/17

ETC5521: Exploratory Data Analysis


Using computational tools to determine whether what is seen in the data can be assumed to apply more broadly

Lecturer: Emi Tanaka

ETC5521.Clayton-x@monash.edu

Week 11 - Session 2


1/17

Visual inference with the nullabor 📦

2/17

nullabor + ggplot2

  • You can construct the null data "by hand" as you have done for Exercise 4 (d) in tutorial 9.
3/17

nullabor + ggplot2

  • You can construct the null data "by hand" as you have done for Exercise 4 (d) in tutorial 9.
  • You will then need to create null plots and then randomly place the data plot to present the lineup.
3/17

nullabor + ggplot2

  • You can construct the null data "by hand" as you have done for Exercise 4 (d) in tutorial 9.
  • You will then need to create null plots and then randomly place the data plot to present the lineup.
  • You'll need to know which one is the data plot so you can tell if viewer's chose the data plot or not.
3/17

nullabor + ggplot2

  • You can construct the null data "by hand" as you have done for Exercise 4 (d) in tutorial 9.
  • You will then need to create null plots and then randomly place the data plot to present the lineup.
  • You'll need to know which one is the data plot so you can tell if viewer's chose the data plot or not.
  • The nullabor package makes it easy to create the data for the lineup and you can use ggplot2 to construct the lineup.
library(nullabor)
library(tidyverse) # which includes ggplot2
3/17

Case study 2 Potato scab infection Part 1/4

data(cochran.crd, package = "agridat")
skimr::skim(cochran.crd)
## ── Data Summary ────────────────────────
## Values
## Name cochran.crd
## Number of rows 32
## Number of columns 4
## _______________________
## Column type frequency:
## factor 1
## numeric 3
## ________________________
## Group variables None
##
## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate ordered n_unique top_counts
## 1 trt 0 1 FALSE 7 O: 8, F12: 4, F3: 4, F6: 4
##
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 inf 0 1 15.7 8.22 4 9 16 19.5 32 ▇▃▇▃▃
## 2 row 0 1 2.5 1.14 1 1.75 2.5 3.25 4 ▇▇▁▇▇
## 3 col 0 1 4.5 2.33 1 2.75 4.5 6.25 8 ▇▃▇▃▇
cochran.crd %>%
ggplot(aes(factor(col), factor(row), fill = inf)) +
geom_tile(color = "black", size = 2) +
geom_text(aes(label = trt)) +
labs(x = "Column", y = "Row", fill = "Infection\npercent") +
scale_fill_continuous_sequential(palette = "Reds 3")
  • Experiment was conducted to investigate the effect of sulfur on controlling scab disease in potatoes.
  • There were seven treatments in total: control plus spring and fall application of 300, 600 or 1200 lbs/acres of sulfur.
  • Employs a completely randomised design with 8 replications for control and 4 replications for other treatments.

W.G. Cochran and G. Cox, 1957. Experimental Designs, 2nd ed. John Wiley, New York.

4/17

Case study 2 Potato scab infection Part 2/4

  • We are testing H0:μ1=μ2=...=μ7 vs. H1: at least one mean is different to others.
5/17

Case study 2 Potato scab infection Part 2/4

  • We are testing H0:μ1=μ2=...=μ7 vs. H1: at least one mean is different to others.
  • Here we don't have to many observation per treatment so we can use a dotplot.
5/17

Case study 2 Potato scab infection Part 2/4

  • We are testing H0:μ1=μ2=...=μ7 vs. H1: at least one mean is different to others.
  • Here we don't have to many observation per treatment so we can use a dotplot.
  • For the method to generate null, we consider permuting the treatment labels.
method <- null_permute("trt")
5/17

Case study 2 Potato scab infection Part 2/4

  • We are testing H0:μ1=μ2=...=μ7 vs. H1: at least one mean is different to others.
  • Here we don't have to many observation per treatment so we can use a dotplot.
  • For the method to generate null, we consider permuting the treatment labels.
method <- null_permute("trt")
  • Then we generate the null data, also embedding the actual data in a random position. Make sure to set.seed to get the same random instance.
set.seed(1)
line_df <- lineup(method, true = cochran.crd, n = 10)
## decrypt("bhMq KJPJ 62 sSQ6P6S2 ua")
5/17

Case study 2 Potato scab infection Part 3/4

glimpse(line_df)
## Rows: 320
## Columns: 5
## $ inf <int> 9, 12, 18, 10, 24, 17, 30, 16, 10, 7, 4, 10, 21, 24, 29, 12, 9, 7, 18, 30, 18, 16, 16, 4, 9, 18, 17, 19, 32, 5, 26, 4, 9, 12, 18, 10, 24, 17, 30, 16, 10, 7, 4, 10, 21, 24, 29, 12, 9,…
## $ trt <fct> S3, F12, S3, F3, O, F3, F12, O, S12, F6, O, F6, S3, F3, O, F12, O, O, S6, S12, S6, F6, S3, F12, S6, F6, S12, S12, O, S6, O, F3, S6, S12, S3, F12, O, O, O, O, F6, F3, S6, F6, F12, S12…
## $ row <int> 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1,…
## $ col <int> 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5,…
## $ .sample <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
  • The .sample variable has information of which sample it is.
  • One of the .sample number belongs to the real data.
line_df %>%
ggplot(aes(trt, inf)) +
geom_point(size = 3, alpha = 1/2) +
facet_wrap(~.sample, nrow = 2) +
theme(axis.text = element_blank(), # remove data context
axis.title = element_blank())
6/17

Case study 2 Potato scab infection Part 4/4

7/17

Case study 2 Potato scab infection Part 4/4

decrypt("bhMq KJPJ 62 sSQ6P6S2 ua")
## [1] "True data in position 5"
7/17

Case study 3 Black Cherry Trees Part 1/4

skimr::skim(trees)
## ── Data Summary ────────────────────────
## Values
## Name trees
## Number of rows 31
## Number of columns 3
## _______________________
## Column type frequency:
## numeric 3
## ________________________
## Group variables None
##
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 Girth 0 1 13.2 3.14 8.3 11.0 12.9 15.2 20.6 ▃▇▃▅▁
## 2 Height 0 1 76 6.37 63 72 76 80 87 ▃▃▆▇▃
## 3 Volume 0 1 30.2 16.4 10.2 19.4 24.2 37.3 77 ▇▅▁▂▁
g1 <- trees %>%
ggplot(aes(Girth, Volume)) +
geom_point() +
scale_x_log10() +
scale_y_log10()
g2 <- trees %>%
ggplot(aes(Height, Volume)) +
geom_point() +
scale_x_log10() +
scale_y_log10()
g1 + g2
  • Data measures the diameter, height and volume of timber in 31 felled black cherry trees.
  • We fit the model
fit <- lm(log(Volume) ~ log(Girth) + log(Height),
data = trees)
fit_df <- trees %>%
# below are needed for lineup
mutate(.resid = residuals(fit),
.fitted = fitted(fit))

Atkinson, A. C. (1985) Plots, Transformations and Regression. Oxford University Press.

8/17

Case study 3 Black Cherry Trees Part 2/4

  • We are testing H0: errors are NID(0,σ2) vs. H1: errors are not NID(0,σ2).
9/17

Case study 3 Black Cherry Trees Part 2/4

  • We are testing H0: errors are NID(0,σ2) vs. H1: errors are not NID(0,σ2).
  • We will use the residual plot as the visual statistic.
9/17

Case study 3 Black Cherry Trees Part 2/4

  • We are testing H0: errors are NID(0,σ2) vs. H1: errors are not NID(0,σ2).
  • We will use the residual plot as the visual statistic.
  • For the method to generate null, we generate residuals from random draws from N(0,ˆσ2).
method <- null_lm(log(Volume) ~ log(Girth) + log(Height),
method = "pboot")
9/17

Case study 3 Black Cherry Trees Part 2/4

  • We are testing H0: errors are NID(0,σ2) vs. H1: errors are not NID(0,σ2).
  • We will use the residual plot as the visual statistic.
  • For the method to generate null, we generate residuals from random draws from N(0,ˆσ2).
method <- null_lm(log(Volume) ~ log(Girth) + log(Height),
method = "pboot")
  • Then we generate the lineup data.
set.seed(2020)
line_df <- lineup(method, true = fit_df, n = 10)
## decrypt("bhMq KJPJ 62 sSQ6P6S2 uT")
9/17

Case study 3 Black Cherry Trees Part 3/4

line_df %>%
ggplot(aes(.fitted, .resid)) +
geom_point(size = 1.2) +
geom_hline(yintercept = 0, linetype = "dashed") +
facet_wrap(~.sample, nrow = 2) +
theme(axis.text = element_blank(), # remove data context
axis.title = element_blank())
10/17

Case study 3 Black Cherry Trees Part 4/4

  • We can have:
    • method = "pboot",
    • method = "boot" or
    • method = "rotate"
      for different (and valid) methods to generate null data when fitting a linear model.
method <- null_lm(log(Volume) ~ log(Girth) + log(Height),
method = "boot")
11/17

Case study 3 Black Cherry Trees Part 4/4

  • We can have:
    • method = "pboot",
    • method = "boot" or
    • method = "rotate"
      for different (and valid) methods to generate null data when fitting a linear model.
method <- null_lm(log(Volume) ~ log(Girth) + log(Height),
method = "boot")
  • We can also consider using a different visual statisitc, e.g. QQ-plot to assess normality.
11/17

Case study 4 Temperatures of stars Part 1/2

  • The data consists of the surface temperature in Kelvin degrees of 96 stars.
12/17

Case study 4 Temperatures of stars Part 1/2

  • The data consists of the surface temperature in Kelvin degrees of 96 stars.
  • We want to check if the surface temperature has an exponential distribution.
12/17

Case study 4 Temperatures of stars Part 1/2

  • The data consists of the surface temperature in Kelvin degrees of 96 stars.
  • We want to check if the surface temperature has an exponential distribution.
  • We use histogram with 30 bins as our visual test statistic.
12/17

Case study 4 Temperatures of stars Part 1/2

  • The data consists of the surface temperature in Kelvin degrees of 96 stars.
  • We want to check if the surface temperature has an exponential distribution.
  • We use histogram with 30 bins as our visual test statistic.
  • For the null data, we will generate from an exponential distribution.
line_df <- lineup(null_dist("temp", "exp", list(rate = 1/mean(dslabs::stars$temp))),
true = dslabs::stars,
n = 10)
## decrypt("bhMq KJPJ 62 sSQ6P6S2 ug")
  • Note: the rate in an exponential distribution can be estimated from the inverse of the sample mean.
12/17

Case study 4 Temperatures of stars Part 2/2

ggplot(line_df, aes(temp)) +
geom_histogram(color = "white") +
facet_wrap(~.sample, nrow = 2) +
theme(axis.text = element_blank(),
axis.title = element_blank())
13/17

Case study 5 Foreign exchange rate Part 1/2

  • The data contains the daily exchange rate of 1 AUD to 1 USD between 9th Jan 2018 to 21st Feb 2018.
  • Does the rate follow an ARIMA model?
data(aud, package = "nullabor")
line_df <- lineup(null_ts("rate", forecast::auto.arima), true = aud, n = 10)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## decrypt("bhMq KJPJ 62 sSQ6P6S2 um")
ggplot(line_df, aes(date, rate)) +
geom_line() +
facet_wrap(~ .sample, scales = "free_y", nrow = 2) +
theme(axis.title = element_blank(),
axis.text = element_blank())
14/17

Case study 5 Foreign exchange rate Part 2/2

15/17

Resources and Acknowledgement

  • Buja, Andreas, Dianne Cook, Heike Hofmann, Michael Lawrence, Eun-Kyung Lee, Deborah F. Swayne, and Hadley Wickham. 2009. “Statistical Inference for Exploratory Data Analysis and Model Diagnostics.” Philosophical Transactions. Series A, Mathematical, Physical, and Engineering Sciences 367 (1906): 4361–83.
  • Wickham, Hadley, Dianne Cook, Heike Hofmann, and Andreas Buja. 2010. “Graphical Inference for Infovis.” IEEE Transactions on Visualization and Computer Graphics 16 (6): 973–79.
  • Hofmann, H., L. Follett, M. Majumder, and D. Cook. 2012. “Graphical Tests for Power Comparison of Competing Designs.” IEEE Transactions on Visualization and Computer Graphics 18 (12): 2441–48.
  • Majumder, M., Heiki Hofmann, and Dianne Cook. 2013. “Validation of Visual Statistical Inference, Applied to Linear Models.” Journal of the American Statistical Association 108 (503): 942–56.
  • Data coding using tidyverse suite of R packages
  • Slides constructed with xaringan, remark.js, knitr, and R Markdown.
16/17

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Lecturer: Emi Tanaka

ETC5521.Clayton-x@monash.edu

Week 11 - Session 2


17/17

ETC5521: Exploratory Data Analysis


Using computational tools to determine whether what is seen in the data can be assumed to apply more broadly

Lecturer: Emi Tanaka

ETC5521.Clayton-x@monash.edu

Week 11 - Session 2


1/17
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow