Loading [MathJax]/jax/output/HTML-CSS/jax.js
+ - 0:00:00
Notes for current slide
Notes for next slide

These slides are viewed best by Chrome or Firefox and occasionally need to be refreshed if elements did not load properly. See here for the PDF .


Press the right arrow to progress to the next slide!

1/20

ETC5521: Exploratory Data Analysis


Initial data analysis

Lecturer: Emi Tanaka

ETC5521.Clayton-x@monash.edu

Week 3 - Session 2


1/20

Linear models in R REVIEW Part 1/3

library(tidyverse)
glimpse(cars)
## Rows: 50
## Columns: 2
## $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13, 14, 14, 14, 14, 15, 15, 15, 16, 16, 17, 17, 17, 18, 18, 18, 18, 19, 19, 19, 20, 20, 20, 20, 20, 22, 23, 24, 24, 24…
## $ dist <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34, 34, 46, 26, 36, 60, 80, 20, 26, 54, 32, 40, 32, 40, 50, 42, 56, 76, 84, 36, 46, 68, 32, 48, 52, 56, 64, 66, 54, 70, 92…
2/20

Linear models in R REVIEW Part 1/3

library(tidyverse)
glimpse(cars)
## Rows: 50
## Columns: 2
## $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13, 14, 14, 14, 14, 15, 15, 15, 16, 16, 17, 17, 17, 18, 18, 18, 18, 19, 19, 19, 20, 20, 20, 20, 20, 22, 23, 24, 24, 24…
## $ dist <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34, 34, 46, 26, 36, 60, 80, 20, 26, 54, 32, 40, 32, 40, 50, 42, 56, 76, 84, 36, 46, 68, 32, 48, 52, 56, 64, 66, 54, 70, 92…
ggplot(cars, aes(speed, dist)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)

2/20

Linear models in R REVIEW Part 2/3

  • We can fit linear models in R with the lm function:
    lm(dist ~ speed, data = cars)
    is the same as
    lm(dist ~ 1 + speed, data = cars)
3/20

Linear models in R REVIEW Part 2/3

  • We can fit linear models in R with the lm function:
    lm(dist ~ speed, data = cars)
    is the same as
    lm(dist ~ 1 + speed, data = cars)
  • The above model is mathematically written as yi=β0+β1xi+ei where
    • yi and xi are the stopping distance (in ft) and speed (in mph), respectively, of the i-th car;
    • β0 and β1 are intercept and slope, respectively; and
    • ei is the random error; usually assuming eiNID(0,σ2).
3/20

Linear models in R REVIEW Part 3/3

fit <- lm(dist ~ 1 + speed, data = cars)
coef(fit)
## (Intercept) speed
## -17.579095 3.932409
  • So ˆβ017.58 and ˆβ13.93.
4/20

Linear models in R REVIEW Part 3/3

fit <- lm(dist ~ 1 + speed, data = cars)
coef(fit)
## (Intercept) speed
## -17.579095 3.932409
  • So ˆβ017.58 and ˆβ13.93.
  • Assuming this model is appropriate, the stopping distance increases by about 4 ft for increase in speed by 1 mph.
4/20

2 Model formulation Part 1/2

  • Say, we are interested in characterising the price of the diamond in terms of its carat.
  • Looking at this plot, would you fit a linear model with formula

price ~ 1 + carat?

4/20

2 Model formulation Part 1/2

  • Say, we are interested in characterising the price of the diamond in terms of its carat.
  • Looking at this plot, would you fit a linear model with formula

price ~ 1 + carat?

5/20

2 Model formulation Part 2/2

  • What about
    price ~ poly(carat, 2)?
    which is the same as fitting:

yi=β0+β1xi+β2x2i+ei.

6/20

2 Model formulation Part 2/2

  • What about
    price ~ poly(carat, 2)?
    which is the same as fitting:

yi=β0+β1xi+β2x2i+ei.

  • Should the assumption for error distribution be modified if so?
6/20

2 Model formulation Part 2/2

  • What about
    price ~ poly(carat, 2)?
    which is the same as fitting:

yi=β0+β1xi+β2x2i+ei.

  • Should the assumption for error distribution be modified if so?
  • Should we make some transformation before modelling?
6/20

2 Model formulation Part 2/2

  • What about
    price ~ poly(carat, 2)?
    which is the same as fitting:

yi=β0+β1xi+β2x2i+ei.

  • Should the assumption for error distribution be modified if so?
  • Should we make some transformation before modelling?
  • Are there other candidate models?
6/20

2 Model formulation Part 2/2

  • Notice that we did no formal statistical inference as we initially try to formulate the model.
7/20

2 Model formulation Part 2/2

  • Notice that we did no formal statistical inference as we initially try to formulate the model.

  • The goal of the main analysis is to characterise the price of a diamond by its carat. This may involve:

    • formal inference for model selection;
    • justification of the selected "final" model; and
    • fitting the final model.
7/20

2 Model formulation Part 2/2

  • Notice that we did no formal statistical inference as we initially try to formulate the model.

  • The goal of the main analysis is to characterise the price of a diamond by its carat. This may involve:

    • formal inference for model selection;
    • justification of the selected "final" model; and
    • fitting the final model.
  • There may be in fact many, many models considered but discarded at the IDA stage.

7/20

2 Model formulation Part 2/2

  • Notice that we did no formal statistical inference as we initially try to formulate the model.

  • The goal of the main analysis is to characterise the price of a diamond by its carat. This may involve:

    • formal inference for model selection;
    • justification of the selected "final" model; and
    • fitting the final model.
  • There may be in fact many, many models considered but discarded at the IDA stage.

  • These discarded models are hardly ever reported. Consequently, majority of reported statistics give a distorted view and it's important to remind yourself what might not be reported.

7/20

Model selection

All models are approximate and tentative; approximate in the sense that no model is exactly true and tentative in that they may be modified in the light of further data

—Chatfield (1985)



8/20

Model selection

All models are approximate and tentative; approximate in the sense that no model is exactly true and tentative in that they may be modified in the light of further data

—Chatfield (1985)



All models are wrong but some are useful

—George Box

8/20

Case study 4 Wheat yield in South Australia Part 1/9

A wheat breeding trial to test 107 varieties (also called genotype) is conducted in a field experiment laid out in a rectangular array with 22 rows and 15 columns.

data("gilmour.serpentine", package = "agridat")
skimr::skim(gilmour.serpentine)
## ── Data Summary ────────────────────────
## Values
## Name gilmour.serpentine
## Number of rows 330
## Number of columns 5
## _______________________
## Column type frequency:
## factor 2
## numeric 3
## ________________________
## Group variables None
##
## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate ordered n_unique top_counts
## 1 rep 0 1 FALSE 3 R1: 110, R2: 110, R3: 110
## 2 gen 0 1 FALSE 107 TIN: 6, VF6: 6, WW1: 6, (WW: 3
##
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 col 0 1 8 4.33 1 4 8 12 15 ▇▇▇▇▇
## 2 row 0 1 11.5 6.35 1 6 11.5 17 22 ▇▆▆▆▇
## 3 yield 0 1 592. 154. 194 469 618. 714. 925 ▂▅▆▇▂

Gilmour, Cullis and Verbyla (1997) Accounting for natural and extraneous variation in the analysis of field experiments. Journal of Agric Biol Env Statistics 2 269-293

9/20

Case study 4 Wheat yield in South Australia Part 2/9

Experimental Design

  • The experiment employs what is referred to as a randomised complete block design (RCBD) (technically it is near-complete and not exactly RCBD due to check varieties have double the replicates of test varieties).
10/20

Case study 4 Wheat yield in South Australia Part 2/9

Experimental Design

  • The experiment employs what is referred to as a randomised complete block design (RCBD) (technically it is near-complete and not exactly RCBD due to check varieties have double the replicates of test varieties).
  • RCBD means that
    • the there are equal number of replicates for each treatment (here it is gen);
    • each treatment appears exactly once in each block;
    • the blocks are of the same size; and
    • each treatment are randomised within block.
10/20

Case study 4 Wheat yield in South Australia Part 2/9

Experimental Design

  • The experiment employs what is referred to as a randomised complete block design (RCBD) (technically it is near-complete and not exactly RCBD due to check varieties have double the replicates of test varieties).
  • RCBD means that
    • the there are equal number of replicates for each treatment (here it is gen);
    • each treatment appears exactly once in each block;
    • the blocks are of the same size; and
    • each treatment are randomised within block.
  • In agricultural field experiments, blocks are formed spatially by grouping plots within contiguous areas (called rep here).
  • The boundaries of blocks may be chosen arbitrary.
10/20

Case study 4 Wheat yield in South Australia Part 3/9

Experimental Design

11/20

Case study 4 Wheat yield in South Australia Part 4/9

Analysis

  • In the main analysis, people would commonly analyse this using what is called two-way ANOVA model (with no interaction effect).
  • The two-way ANOVA model has the form
    yield = mean + block + treatment + error
  • So for this data,
fit <- lm(yield ~ 1 + rep + gen,
data = gilmour.serpentine)
12/20

Case study 4 Wheat yield in South Australia Part 5/9

Analysis

summary(fit)
##
## Call:
## lm(formula = yield ~ 1 + rep + gen, data = gilmour.serpentine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -245.070 -69.695 -1.182 71.427 250.652
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 720.248 67.335 10.697 < 2e-16 ***
## repR2 96.100 15.585 6.166 3.29e-09 ***
## repR3 -129.845 15.585 -8.331 8.44e-15 ***
## gen(WqKPWmH*3Ag 24.333 94.372 0.258 0.796766
## genAMERY -93.333 94.372 -0.989 0.323747
## genANGAS -132.667 94.372 -1.406 0.161192
## genAROONA -153.667 94.372 -1.628 0.104884
## genBATAVIA -175.333 94.372 -1.858 0.064513 .
## genBD231 -70.333 94.372 -0.745 0.456895
## genBEULAH -173.667 94.372 -1.840 0.067074 .
## genBLADE -270.000 94.372 -2.861 0.004628 **
## genBT_SCHOMBURG -49.000 94.372 -0.519 0.604125
## genCADOUX -223.333 94.372 -2.367 0.018820 *
## genCONDOR -124.333 94.372 -1.317 0.189041
## genCORRIGIN -217.667 94.372 -2.306 0.022010 *
## genCUNNINGHAM -254.667 94.372 -2.699 0.007502 **
## genDGR/MNX-9-9e -47.667 94.372 -0.505 0.613996
## genDOLLARBIRD -200.667 94.372 -2.126 0.034584 *
## genEXCALIBUR -55.000 94.372 -0.583 0.560621
## genGOROKE -141.667 94.372 -1.501 0.134743
## genHALBERD -53.333 94.372 -0.565 0.572551
## genHOUTMAN -209.333 94.372 -2.218 0.027560 *
## genJANZ -214.667 94.372 -2.275 0.023884 *
## genK2011-5* -87.333 94.372 -0.925 0.355758
## genKATUNGA -110.333 94.372 -1.169 0.243609
## genKIATA -165.667 94.372 -1.755 0.080565 .
## genKITE -180.000 94.372 -1.907 0.057772 .
## genKULIN -91.000 94.372 -0.964 0.335964
## genLARK -336.333 94.372 -3.564 0.000448 ***
## genLOWAN -152.333 94.372 -1.614 0.107915
## genM4997 -146.000 94.372 -1.547 0.123277
## genM5075 -194.667 94.372 -2.063 0.040304 *
## genM5097 -102.667 94.372 -1.088 0.277826
## genMACHETE -231.333 94.372 -2.451 0.015010 *
## genMEERING -247.667 94.372 -2.624 0.009286 **
## genMOLINEUX -165.667 94.372 -1.755 0.080565 .
## genOSPREY -162.000 94.372 -1.717 0.087451 .
## genOUYEN -136.667 94.372 -1.448 0.148986
## genOXLEY -221.667 94.372 -2.349 0.019713 *
## genPELSART -200.333 94.372 -2.123 0.034882 *
## genPEROUSE -283.667 94.372 -3.006 0.002955 **
## genRAC655 -112.667 94.372 -1.194 0.233813
## genRAC655'S' -113.667 94.372 -1.204 0.229702
## genRAC696 -3.667 94.372 -0.039 0.969042
## genRAC710 -51.000 94.372 -0.540 0.589455
## genRAC750 -77.333 94.372 -0.819 0.413410
## genRAC759 -42.000 94.372 -0.445 0.656721
## genRAC772 5.000 94.372 0.053 0.957794
## genRAC777 -172.333 94.372 -1.826 0.069183 .
## genRAC779 3.667 94.372 0.039 0.969042
## genRAC787 -118.000 94.372 -1.250 0.212486
## genRAC791 -72.667 94.372 -0.770 0.442120
## genRAC792 -102.333 94.372 -1.084 0.279385
## genRAC798 -1.667 94.372 -0.018 0.985926
## genRAC804 -45.000 94.372 -0.477 0.633949
## genRAC805 -43.000 94.372 -0.456 0.649093
## genRAC806 -35.333 94.372 -0.374 0.708462
## genRAC807 -91.333 94.372 -0.968 0.334201
## genRAC808 -54.000 94.372 -0.572 0.567765
## genRAC809 -43.333 94.372 -0.459 0.646559
## genRAC810 -131.667 94.372 -1.395 0.164359
## genRAC811 42.333 94.372 0.449 0.654174
## genRAC812 -94.000 94.372 -0.996 0.320310
## genRAC813 -83.333 94.372 -0.883 0.378179
## genRAC814 -72.333 94.372 -0.766 0.444214
## genRAC815 -111.000 94.372 -1.176 0.240781
## genRAC816 -66.333 94.372 -0.703 0.482862
## genRAC817 -100.000 94.372 -1.060 0.290466
## genRAC818 -107.000 94.372 -1.134 0.258101
## genRAC819 -121.333 94.372 -1.286 0.199895
## genRAC820 -1.000 94.372 -0.011 0.991555
## genRAC821 -98.333 94.372 -1.042 0.298560
## genROSELLA -184.333 94.372 -1.953 0.052050 .
## genSCHOMBURGK -132.333 94.372 -1.402 0.162242
## genSHRIKE -128.000 94.372 -1.356 0.176376
## genSPEAR -254.667 94.372 -2.699 0.007502 **
## genSTILETTO -157.000 94.372 -1.664 0.097603 .
## genSUNBRI -218.333 94.372 -2.314 0.021612 *
## genSUNFIELD -206.667 94.372 -2.190 0.029576 *
## genSUNLAND -182.667 94.372 -1.936 0.054192 .
## genSWIFT -197.000 94.372 -2.087 0.037990 *
## genTASMAN -161.000 94.372 -1.706 0.089410 .
## genTATIARA -64.333 94.372 -0.682 0.496142
## genTINCURRIN -19.000 81.728 -0.232 0.816382
## genTRIDENT -132.667 94.372 -1.406 0.161192
## genVF299 -66.333 94.372 -0.703 0.482862
## genVF300 -111.667 94.372 -1.183 0.237976
## genVF302 -108.333 94.372 -1.148 0.252234
## genVF508 11.667 94.372 0.124 0.901725
## genVF519 -1.000 94.372 -0.011 0.991555
## genVF655 -160.167 81.728 -1.960 0.051283 .
## genVF664 -106.667 94.372 -1.130 0.259583
## genVG127 -109.667 94.372 -1.162 0.246460
## genVG503 -43.000 94.372 -0.456 0.649093
## genVG506 -108.667 94.372 -1.151 0.250782
## genVG701 -19.333 94.372 -0.205 0.837867
## genVG714 -108.333 94.372 -1.148 0.252234
## genVG878 52.333 94.372 0.555 0.579767
## genWARBLER -217.000 94.372 -2.299 0.022415 *
## genWI216 4.000 94.372 0.042 0.966230
## genWI221 -17.333 94.372 -0.184 0.854440
## genWI231 -218.333 94.372 -2.314 0.021612 *
## genWI232 -56.333 94.372 -0.597 0.551165
## genWILGOYNE -131.000 94.372 -1.388 0.166496
## genWW1402 -117.333 94.372 -1.243 0.215071
## genWW1477 -185.667 81.728 -2.272 0.024064 *
## genWW1831 -86.667 94.372 -0.918 0.359435
## genWYUNA -176.667 94.372 -1.872 0.062524 .
## genYARRALINKA -245.000 94.372 -2.596 0.010061 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 115.6 on 221 degrees of freedom
## Multiple R-squared: 0.6226, Adjusted R-squared: 0.4381
## F-statistic: 3.375 on 108 and 221 DF, p-value: 1.081e-14
13/20

Case study 4 Wheat yield in South Australia Part 6/9

14/20

Case study 4 Wheat yield in South Australia Part 7/9

Do you notice anything from below?


15/20

Case study 4 Wheat yield in South Australia Part 8/9

16/20

Case study 4 Wheat yield in South Australia Part 9/9

  • It's well known in agricultural field trials that spatial variations are introduced in traits; this could be because of the fertility trend, management practices or other reasons.
  • In the IDA stage, you investigate to identify these spatial variations - you cannot just simply fit a two-way ANOVA model!

17/20

"Teaching of Statistics should provide a more balanced blend of IDA and inference"

Chatfield (1985)

18/20

"Teaching of Statistics should provide a more balanced blend of IDA and inference"

Chatfield (1985)



Yet there is still very little emphasis of it in teaching and also at times in practice.

18/20

"Teaching of Statistics should provide a more balanced blend of IDA and inference"

Chatfield (1985)



Yet there is still very little emphasis of it in teaching and also at times in practice.


So don't forget to do IDA!

18/20

Take away messages

19/20

Take away messages

  • Initial data analysis (IDA) is a model-focussed exploration of data with two main objectives:
    • data description including scrutinizing for data quality, and
    • model formulation without any formal statistical inference.
19/20

Take away messages

  • Initial data analysis (IDA) is a model-focussed exploration of data with two main objectives:
    • data description including scrutinizing for data quality, and
    • model formulation without any formal statistical inference.
  • IDA hardly sees the limelight even if it's the very foundation of what the main analysis is built on.
19/20

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Lecturer: Emi Tanaka

ETC5521.Clayton-x@monash.edu

Week 3 - Session 2


20/20

ETC5521: Exploratory Data Analysis


Initial data analysis

Lecturer: Emi Tanaka

ETC5521.Clayton-x@monash.edu

Week 3 - Session 2


1/20
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow