class: middle center hide-slide-number monash-bg-gray80 .info-box.w-50.bg-white[ These slides are viewed best by Chrome or Firefox and occasionally need to be refreshed if elements did not load properly. See <a href=lecture-11B.pdf>here for the PDF <i class="fas fa-file-pdf"></i></a>. ] <br> .white[Press the **right arrow** to progress to the next slide!] --- class: title-slide count: false background-image: url("images/bg-01.png") # .monash-blue[ETC5521: Exploratory Data Analysis] <h1 class="monash-blue" style="font-size: 30pt!important;"></h1> <br> <h2 style="font-weight:900!important;">Using computational tools to determine whether what is seen in the data can be assumed to apply more broadly</h2> .bottom_abs.width100[ Lecturer: *Emi Tanaka* <i class="fas fa-envelope"></i> ETC5521.Clayton-x@monash.edu <i class="fas fa-calendar-alt"></i> Week 11 - Session 2 <br> ] <style type="text/css"> .gray80 { color: #505050!important; font-weight: 300; } .bg-gray80 { background-color: #DCDCDC!important; } .font18 { font-size: 18pt; } </style> --- class: transition # Visual inference with the nullabor π¦ --- # `nullabor` + `ggplot2` * You can construct the null data "by hand" as you have done for Exercise 4 (d) in tutorial 9. -- * You will then need to create null plots and then randomly place the data plot to present the lineup. -- * You'll need to know which one is the data plot so you can tell if viewer's chose the data plot or not. -- * The `nullabor` package makes it easy to create the data for the lineup and you can use `ggplot2` to construct the lineup. ```r library(nullabor) library(tidyverse) # which includes ggplot2 ``` --- # .orange[Case study] .circle.white.bg-orange[2] Potato scab infection .font_small[Part 1/4] .grid[ .item[ .panelset[ .panel[.panel-name[π] <img src="images/week11B/ex1-plot-1.png" width="432" style="display: block; margin: auto;" /> ] .panel[.panel-name[data] .h200.scroll-sign[ ```r data(cochran.crd, package = "agridat") skimr::skim(cochran.crd) ``` ``` ## ββ Data Summary ββββββββββββββββββββββββ ## Values ## Name cochran.crd ## Number of rows 32 ## Number of columns 4 ## _______________________ ## Column type frequency: ## factor 1 ## numeric 3 ## ________________________ ## Group variables None ## ## ββ Variable type: factor βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ## skim_variable n_missing complete_rate ordered n_unique top_counts ## 1 trt 0 1 FALSE 7 O: 8, F12: 4, F3: 4, F6: 4 ## ## ββ Variable type: numeric ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 inf 0 1 15.7 8.22 4 9 16 19.5 32 βββββ ## 2 row 0 1 2.5 1.14 1 1.75 2.5 3.25 4 βββββ ## 3 col 0 1 4.5 2.33 1 2.75 4.5 6.25 8 βββββ ``` ]] .panel[.panel-name[R] ```r cochran.crd %>% ggplot(aes(factor(col), factor(row), fill = inf)) + geom_tile(color = "black", size = 2) + geom_text(aes(label = trt)) + labs(x = "Column", y = "Row", fill = "Infection\npercent") + scale_fill_continuous_sequential(palette = "Reds 3") ``` ] ] ] .item[ * Experiment was conducted to investigate the effect of sulfur on controlling scab disease in potatoes. * There were seven treatments in total: control plus spring and fall application of 300, 600 or 1200 lbs/acres of sulfur. * Employs a completely randomised design with 8 replications for control and 4 replications for other treatments. ] ] .footnote[ W.G. Cochran and G. Cox, 1957. Experimental Designs, 2nd ed. John Wiley, New York. ] --- # .orange[Case study] .circle.white.bg-orange[2] Potato scab infection .font_small[Part 2/4] * We are testing `\(H_0: \mu_1 = \mu_2 = ... = \mu_7\)` vs. `\(H_1:\)` at least one mean is different to others. -- * Here we don't have to many observation per treatment so we can use a dotplot. -- * For the method to generate null, we consider permuting the treatment labels. ```r method <- null_permute("trt") ``` -- * Then we generate the null data, also embedding the actual data in a random position. .font_small[Make sure to `set.seed` to get the same random instance.] ```r set.seed(1) line_df <- lineup(method, true = cochran.crd, n = 10) ``` ``` ## decrypt("bhMq KJPJ 62 sSQ6P6S2 ua") ``` --- # .orange[Case study] .circle.white.bg-orange[2] Potato scab infection .font_small[Part 3/4] .f4[ ```r glimpse(line_df) ``` ``` ## Rows: 320 ## Columns: 5 ## $ inf <int> 9, 12, 18, 10, 24, 17, 30, 16, 10, 7, 4, 10, 21, 24, 29, 12, 9, 7, 18, 30, 18, 16, 16, 4, 9, 18, 17, 19, 32, 5, 26, 4, 9, 12, 18, 10, 24, 17, 30, 16, 10, 7, 4, 10, 21, 24, 29, 12, 9,β¦ ## $ trt <fct> S3, F12, S3, F3, O, F3, F12, O, S12, F6, O, F6, S3, F3, O, F12, O, O, S6, S12, S6, F6, S3, F12, S6, F6, S12, S12, O, S6, O, F3, S6, S12, S3, F12, O, O, O, O, F6, F3, S6, F6, F12, S12β¦ ## $ row <int> 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1,β¦ ## $ col <int> 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5,β¦ ## $ .sample <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,β¦ ``` ] * The `.sample` variable has information of which sample it is. * One of the `.sample` number belongs to the real data. .f4[ ```r line_df %>% ggplot(aes(trt, inf)) + geom_point(size = 3, alpha = 1/2) + facet_wrap(~.sample, nrow = 2) + theme(axis.text = element_blank(), # remove data context axis.title = element_blank()) ``` ] --- # .orange[Case study] .circle.white.bg-orange[2] Potato scab infection .font_small[Part 4/4] .grid[ .item[ <img src="images/week11B/ex1-lineplot-1.png" width="1008" style="display: block; margin: auto;" /> {{content}} ] ] -- ```r decrypt("bhMq KJPJ 62 sSQ6P6S2 ua") ``` ``` ## [1] "True data in position 5" ``` --- # .orange[Case study] .circle.white.bg-orange[3] Black Cherry Trees .font_small[Part 1/4] .grid[ .item[ .panelset[ .panel[.panel-name[π] <img src="images/week11B/ex2-plot-1.png" width="432" style="display: block; margin: auto;" /> ] .panel[.panel-name[data] .h200.scroll-sign[ ```r skimr::skim(trees) ``` ``` ## ββ Data Summary ββββββββββββββββββββββββ ## Values ## Name trees ## Number of rows 31 ## Number of columns 3 ## _______________________ ## Column type frequency: ## numeric 3 ## ________________________ ## Group variables None ## ## ββ Variable type: numeric ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 Girth 0 1 13.2 3.14 8.3 11.0 12.9 15.2 20.6 ββββ β ## 2 Height 0 1 76 6.37 63 72 76 80 87 βββββ ## 3 Volume 0 1 30.2 16.4 10.2 19.4 24.2 37.3 77 ββ βββ ``` ]] .panel[.panel-name[R] ```r g1 <- trees %>% ggplot(aes(Girth, Volume)) + geom_point() + scale_x_log10() + scale_y_log10() g2 <- trees %>% ggplot(aes(Height, Volume)) + geom_point() + scale_x_log10() + scale_y_log10() g1 + g2 ``` ] ] ] .item[ * Data measures the diameter, height and volume of timber in 31 felled black cherry trees. * We fit the model ```r fit <- lm(log(Volume) ~ log(Girth) + log(Height), data = trees) fit_df <- trees %>% # below are needed for lineup mutate(.resid = residuals(fit), .fitted = fitted(fit)) ``` ] ] .footnote[ Atkinson, A. C. (1985) Plots, Transformations and Regression. Oxford University Press. ] --- # .orange[Case study] .circle.white.bg-orange[3] Black Cherry Trees .font_small[Part 2/4] * We are testing `\(H_0:\)` errors are `\(NID(0, \sigma^2)\)` vs. `\(H_1:\)` errors are not `\(NID(0, \sigma^2)\)`. -- * We will use the residual plot as the visual statistic. -- * For the method to generate null, we generate residuals from random draws from `\(N(0, \hat{\sigma}^2)\)`. ```r method <- null_lm(log(Volume) ~ log(Girth) + log(Height), method = "pboot") ``` -- * Then we generate the lineup data. ```r set.seed(2020) line_df <- lineup(method, true = fit_df, n = 10) ``` ``` ## decrypt("bhMq KJPJ 62 sSQ6P6S2 uT") ``` --- # .orange[Case study] .circle.white.bg-orange[3] Black Cherry Trees .font_small[Part 3/4] .grid[ .item[ .panelset[ .panel[.panel-name[π] <img src="images/week11B/ex2-lineplot-1.png" width="1008" style="display: block; margin: auto;" /> ] .panel[.panel-name[R] ```r line_df %>% ggplot(aes(.fitted, .resid)) + geom_point(size = 1.2) + geom_hline(yintercept = 0, linetype = "dashed") + facet_wrap(~.sample, nrow = 2) + theme(axis.text = element_blank(), # remove data context axis.title = element_blank()) ``` ]] ] ] --- # .orange[Case study] .circle.white.bg-orange[3] Black Cherry Trees .font_small[Part 4/4] * We can have: * `method = "pboot"`, * `method = "boot"` or * `method = "rotate"` for different (and valid) methods to generate null data when fitting a linear model. ```r method <- null_lm(log(Volume) ~ log(Girth) + log(Height), * method = "boot") ``` -- * We can also consider using a different visual statisitc, e.g. QQ-plot to assess normality. --- # .orange[Case study] .circle.white.bg-orange[4] Temperatures of stars .font_small[Part 1/2] * The data consists of the surface temperature in Kelvin degrees of 96 stars. -- * We want to check if the surface temperature has an exponential distribution. -- * We use histogram with 30 bins as our visual test statistic. -- * For the null data, we will generate from an exponential distribution. ```r line_df <- lineup(null_dist("temp", "exp", list(rate = 1/mean(dslabs::stars$temp))), true = dslabs::stars, n = 10) ``` ``` ## decrypt("bhMq KJPJ 62 sSQ6P6S2 ug") ``` * Note: the rate in an exponential distribution can be estimated from the inverse of the sample mean. --- # .orange[Case study] .circle.white.bg-orange[4] Temperatures of stars .font_small[Part 2/2] .grid[ .item[ .panelset[ .panel[.panel-name[π] <img src="images/week11B/stars-lineup-1.png" width="1008" style="display: block; margin: auto;" /> ] .panel[.panel-name[R] ```r ggplot(line_df, aes(temp)) + geom_histogram(color = "white") + facet_wrap(~.sample, nrow = 2) + theme(axis.text = element_blank(), axis.title = element_blank()) ``` ]] ]] --- # .orange[Case study] .circle.white.bg-orange[5] Foreign exchange rate .font_small[Part 1/2] * The data contains the daily exchange rate of 1 AUD to 1 USD between 9th Jan 2018 to 21st Feb 2018. * Does the rate follow an ARIMA model? ```r data(aud, package = "nullabor") line_df <- lineup(null_ts("rate", forecast::auto.arima), true = aud, n = 10) ``` ``` ## Registered S3 method overwritten by 'quantmod': ## method from ## as.zoo.data.frame zoo ``` ``` ## decrypt("bhMq KJPJ 62 sSQ6P6S2 um") ``` ```r ggplot(line_df, aes(date, rate)) + geom_line() + facet_wrap(~ .sample, scales = "free_y", nrow = 2) + theme(axis.title = element_blank(), axis.text = element_blank()) ``` --- # .orange[Case study] .circle.white.bg-orange[5] Foreign exchange rate .font_small[Part 2/2] .grid[ .item[ <img src="images/week11B/ts-plot-1.png" width="1008" style="display: block; margin: auto;" /> ]] --- # Resources and Acknowledgement .font18[ - Buja, Andreas, Dianne Cook, Heike Hofmann, Michael Lawrence, Eun-Kyung Lee, Deborah F. Swayne, and Hadley Wickham. 2009. βStatistical Inference for Exploratory Data Analysis and Model Diagnostics.β Philosophical Transactions. Series A, Mathematical, Physical, and Engineering Sciences 367 (1906): 4361β83. - Wickham, Hadley, Dianne Cook, Heike Hofmann, and Andreas Buja. 2010. βGraphical Inference for Infovis.β IEEE Transactions on Visualization and Computer Graphics 16 (6): 973β79. - Hofmann, H., L. Follett, M. Majumder, and D. Cook. 2012. βGraphical Tests for Power Comparison of Competing Designs.β IEEE Transactions on Visualization and Computer Graphics 18 (12): 2441β48. - Majumder, M., Heiki Hofmann, and Dianne Cook. 2013. βValidation of Visual Statistical Inference, Applied to Linear Models.β Journal of the American Statistical Association 108 (503): 942β56. - Data coding using [`tidyverse` suite of R packages](https://www.tidyverse.org) - Slides constructed with [`xaringan`](https://github.com/yihui/xaringan), [remark.js](https://remarkjs.com), [`knitr`](http://yihui.name/knitr), and [R Markdown](https://rmarkdown.rstudio.com). ] --- background-size: cover class: title-slide background-image: url("images/bg-01.png") <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. .bottom_abs.width100[ Lecturer: *Emi Tanaka* <i class="fas fa-envelope"></i> ETC5521.Clayton-x@monash.edu <i class="fas fa-calendar-alt"></i> Week 11 - Session 2 <br> ]