These slides are viewed best by Chrome or Firefox and occasionally need to be refreshed if elements did not load properly. See here for the PDF .

Press the right arrow to progress to the next slide!

1/20

ETC5521: Exploratory Data Analysis

Initial data analysis

Lecturer: Emi Tanaka

ETC5521.Clayton-x@monash.edu

Week 3 - Session 2

1/20

Linear models in R REVIEW Part 1/3library(tidyverse)
glimpse(cars)

## Rows: 50
## Columns: 2
## $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13, 14, 14, 14, 14, 15, 15, 15, 16, 16, 17, 17, 17, 18, 18, 18, 18, 19, 19, 19, 20, 20, 20, 20, 20, 22, 23, 24, 24, 24…
## $ dist  <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34, 34, 46, 26, 36, 60, 80, 20, 26, 54, 32, 40, 32, 40, 50, 42, 56, 76, 84, 36, 46, 68, 32, 48, 52, 56, 64, 66, 54, 70, 92…
2/20

Linear models in R REVIEW Part 1/3

library(tidyverse)
glimpse(cars)

## Rows: 50
## Columns: 2
## $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13, 14, 14, 14, 14, 15, 15, 15, 16, 16, 17, 17, 17, 18, 18, 18, 18, 19, 19, 19, 20, 20, 20, 20, 20, 22, 23, 24, 24, 24…
## $ dist  <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34, 34, 46, 26, 36, 60, 80, 20, 26, 54, 32, 40, 32, 40, 50, 42, 56, 76, 84, 36, 46, 68, 32, 48, 52, 56, 64, 66, 54, 70, 92…

ggplot(cars, aes(speed, dist)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)

2/20

Linear models in R REVIEW Part 2/3We can fit linear models in R with the lm function:lm(dist ~ speed, data = cars)

is the same aslm(dist ~ 1 + speed, data = cars)

3/20

Linear models in R REVIEW Part 2/3We can fit linear models in R with the lm function:lm(dist ~ speed, data = cars)

is the same aslm(dist ~ 1 + speed, data = cars)

The above model is mathematically written as 
yi=β0+β1xi+ei
where yi and xi are the  stopping distance (in ft) and speed (in mph), respectively, of the i-th car;
β0 and β1 are intercept and slope, respectively; and
ei is the random error; usually assuming ei∼NID(0,σ2). 
3/20

Linear models in R REVIEW Part 3/3

fit <- lm(dist ~ 1 + speed, data = cars)
coef(fit)

## (Intercept)       speed 
##  -17.579095    3.932409

So $\hat{\beta}_0 \approx -17.58$ and $\hat{\beta}_1 \approx 3.93$ .

4/20

Linear models in R REVIEW Part 3/3

fit <- lm(dist ~ 1 + speed, data = cars)
coef(fit)

## (Intercept)       speed 
##  -17.579095    3.932409

So $\hat{\beta}_0 \approx -17.58$ and $\hat{\beta}_1 \approx 3.93$ .

Assuming this model is appropriate, the stopping distance increases by about 4 ft for increase in speed by 1 mph.

4/20

2 Model formulation Part 1/2

Say, we are interested in characterising the price of the diamond in terms of its carat.
Looking at this plot, would you fit a linear model with formula

price ~ 1 + carat?

4/20

2 Model formulation Part 1/2

Say, we are interested in characterising the price of the diamond in terms of its carat.
Looking at this plot, would you fit a linear model with formula

price ~ 1 + carat?

5/20

2 Model formulation Part 2/2

What about price ~ poly(carat, 2)? which is the same as fitting:

$y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + e_i.$

6/20

2 Model formulation Part 2/2

What about price ~ poly(carat, 2)? which is the same as fitting:

$y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + e_i.$

Should the assumption for error distribution be modified if so?

6/20

2 Model formulation Part 2/2

What about price ~ poly(carat, 2)? which is the same as fitting:

$y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + e_i.$

Should the assumption for error distribution be modified if so?

Should we make some transformation before modelling?

6/20

2 Model formulation Part 2/2

What about price ~ poly(carat, 2)? which is the same as fitting:

$y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + e_i.$

Should the assumption for error distribution be modified if so?

Should we make some transformation before modelling?

Are there other candidate models?

6/20

2 Model formulation Part 2/2Notice that we did no formal statistical inference as we initially try to formulate the model.
7/20

2 Model formulation Part 2/2

Notice that we did no formal statistical inference as we initially try to formulate the model.
The goal of the main analysis is to characterise the price of a diamond by its carat. This may involve:
- formal inference for model selection;
- justification of the selected "final" model; and
- fitting the final model.

7/20

2 Model formulation Part 2/2

Notice that we did no formal statistical inference as we initially try to formulate the model.
The goal of the main analysis is to characterise the price of a diamond by its carat. This may involve:
- formal inference for model selection;
- justification of the selected "final" model; and
- fitting the final model.
There may be in fact many, many models considered but discarded at the IDA stage.

7/20

2 Model formulation Part 2/2

Notice that we did no formal statistical inference as we initially try to formulate the model.
The goal of the main analysis is to characterise the price of a diamond by its carat. This may involve:
- formal inference for model selection;
- justification of the selected "final" model; and
- fitting the final model.
There may be in fact many, many models considered but discarded at the IDA stage.
These discarded models are hardly ever reported. Consequently, majority of reported statistics give a distorted view and it's important to remind yourself what might not be reported.

7/20

Model selection

All models are approximate and tentative; approximate in the sense that no model is exactly true and tentative in that they may be modified in the light of further data

—Chatfield (1985)

8/20

Model selection

All models are approximate and tentative; approximate in the sense that no model is exactly true and tentative in that they may be modified in the light of further data

—Chatfield (1985)

All models are wrong but some are useful

—George Box

8/20

Case study 4 Wheat yield in South Australia Part 1/9

A wheat breeding trial to test 107 varieties (also called genotype) is conducted in a field experiment laid out in a rectangular array with 22 rows and 15 columns.

data("gilmour.serpentine", package = "agridat")
skimr::skim(gilmour.serpentine)

## ── Data Summary ────────────────────────
##                            Values            
## Name                       gilmour.serpentine
## Number of rows             330               
## Number of columns          5                 
## _______________________                      
## Column type frequency:                       
##   factor                   2                 
##   numeric                  3                 
## ________________________                     
## Group variables            None              
## 
## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate ordered n_unique top_counts                    
## 1 rep                   0             1 FALSE          3 R1: 110, R2: 110, R3: 110     
## 2 gen                   0             1 FALSE        107 TIN: 6, VF6: 6, WW1: 6, (WW: 3
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate  mean     sd    p0   p25   p50   p75  p100 hist 
## 1 col                   0             1   8     4.33     1     4   8     12     15 ▇▇▇▇▇
## 2 row                   0             1  11.5   6.35     1     6  11.5   17     22 ▇▆▆▆▇
## 3 yield                 0             1 592.  154.     194   469 618.   714.   925 ▂▅▆▇▂

Gilmour, Cullis and Verbyla (1997) Accounting for natural and extraneous variation in the analysis of field experiments. Journal of Agric Biol Env Statistics 2 269-293

9/20

Case study 4 Wheat yield in South Australia Part 2/9

Experimental Design

The experiment employs what is referred to as a randomised complete block design (RCBD) (technically it is near-complete and not exactly RCBD due to check varieties have double the replicates of test varieties).

10/20

Case study 4 Wheat yield in South Australia Part 2/9

Experimental Design

The experiment employs what is referred to as a randomised complete block design (RCBD) (technically it is near-complete and not exactly RCBD due to check varieties have double the replicates of test varieties).
RCBD means that
- the there are equal number of replicates for each treatment (here it is gen);
- each treatment appears exactly once in each block;
- the blocks are of the same size; and
- each treatment are randomised within block.

10/20

Case study 4 Wheat yield in South Australia Part 2/9

Experimental Design

The experiment employs what is referred to as a randomised complete block design (RCBD) (technically it is near-complete and not exactly RCBD due to check varieties have double the replicates of test varieties).
RCBD means that
- the there are equal number of replicates for each treatment (here it is gen);
- each treatment appears exactly once in each block;
- the blocks are of the same size; and
- each treatment are randomised within block.

In agricultural field experiments, blocks are formed spatially by grouping plots within contiguous areas (called rep here).
The boundaries of blocks may be chosen arbitrary.

10/20

Case study 4 Wheat yield in South Australia Part 3/9

Experimental Design

11/20

Case study 4 Wheat yield in South Australia Part 4/9

Analysis

In the main analysis, people would commonly analyse this using what is called two-way ANOVA model (with no interaction effect).
The two-way ANOVA model has the form yield = mean + block + treatment + error
So for this data,

fit <- lm(yield ~ 1 + rep + gen, 
          data = gilmour.serpentine)

12/20

Case study 4 Wheat yield in South Australia Part 5/9

Analysis

summary(fit)

## 
## Call:
## lm(formula = yield ~ 1 + rep + gen, data = gilmour.serpentine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -245.070  -69.695   -1.182   71.427  250.652 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      720.248     67.335  10.697  < 2e-16 ***
## repR2             96.100     15.585   6.166 3.29e-09 ***
## repR3           -129.845     15.585  -8.331 8.44e-15 ***
## gen(WqKPWmH*3Ag   24.333     94.372   0.258 0.796766    
## genAMERY         -93.333     94.372  -0.989 0.323747    
## genANGAS        -132.667     94.372  -1.406 0.161192    
## genAROONA       -153.667     94.372  -1.628 0.104884    
## genBATAVIA      -175.333     94.372  -1.858 0.064513 .  
## genBD231         -70.333     94.372  -0.745 0.456895    
## genBEULAH       -173.667     94.372  -1.840 0.067074 .  
## genBLADE        -270.000     94.372  -2.861 0.004628 ** 
## genBT_SCHOMBURG  -49.000     94.372  -0.519 0.604125    
## genCADOUX       -223.333     94.372  -2.367 0.018820 *  
## genCONDOR       -124.333     94.372  -1.317 0.189041    
## genCORRIGIN     -217.667     94.372  -2.306 0.022010 *  
## genCUNNINGHAM   -254.667     94.372  -2.699 0.007502 ** 
## genDGR/MNX-9-9e  -47.667     94.372  -0.505 0.613996    
## genDOLLARBIRD   -200.667     94.372  -2.126 0.034584 *  
## genEXCALIBUR     -55.000     94.372  -0.583 0.560621    
## genGOROKE       -141.667     94.372  -1.501 0.134743    
## genHALBERD       -53.333     94.372  -0.565 0.572551    
## genHOUTMAN      -209.333     94.372  -2.218 0.027560 *  
## genJANZ         -214.667     94.372  -2.275 0.023884 *  
## genK2011-5*      -87.333     94.372  -0.925 0.355758    
## genKATUNGA      -110.333     94.372  -1.169 0.243609    
## genKIATA        -165.667     94.372  -1.755 0.080565 .  
## genKITE         -180.000     94.372  -1.907 0.057772 .  
## genKULIN         -91.000     94.372  -0.964 0.335964    
## genLARK         -336.333     94.372  -3.564 0.000448 ***
## genLOWAN        -152.333     94.372  -1.614 0.107915    
## genM4997        -146.000     94.372  -1.547 0.123277    
## genM5075        -194.667     94.372  -2.063 0.040304 *  
## genM5097        -102.667     94.372  -1.088 0.277826    
## genMACHETE      -231.333     94.372  -2.451 0.015010 *  
## genMEERING      -247.667     94.372  -2.624 0.009286 ** 
## genMOLINEUX     -165.667     94.372  -1.755 0.080565 .  
## genOSPREY       -162.000     94.372  -1.717 0.087451 .  
## genOUYEN        -136.667     94.372  -1.448 0.148986    
## genOXLEY        -221.667     94.372  -2.349 0.019713 *  
## genPELSART      -200.333     94.372  -2.123 0.034882 *  
## genPEROUSE      -283.667     94.372  -3.006 0.002955 ** 
## genRAC655       -112.667     94.372  -1.194 0.233813    
## genRAC655'S'    -113.667     94.372  -1.204 0.229702    
## genRAC696         -3.667     94.372  -0.039 0.969042    
## genRAC710        -51.000     94.372  -0.540 0.589455    
## genRAC750        -77.333     94.372  -0.819 0.413410    
## genRAC759        -42.000     94.372  -0.445 0.656721    
## genRAC772          5.000     94.372   0.053 0.957794    
## genRAC777       -172.333     94.372  -1.826 0.069183 .  
## genRAC779          3.667     94.372   0.039 0.969042    
## genRAC787       -118.000     94.372  -1.250 0.212486    
## genRAC791        -72.667     94.372  -0.770 0.442120    
## genRAC792       -102.333     94.372  -1.084 0.279385    
## genRAC798         -1.667     94.372  -0.018 0.985926    
## genRAC804        -45.000     94.372  -0.477 0.633949    
## genRAC805        -43.000     94.372  -0.456 0.649093    
## genRAC806        -35.333     94.372  -0.374 0.708462    
## genRAC807        -91.333     94.372  -0.968 0.334201    
## genRAC808        -54.000     94.372  -0.572 0.567765    
## genRAC809        -43.333     94.372  -0.459 0.646559    
## genRAC810       -131.667     94.372  -1.395 0.164359    
## genRAC811         42.333     94.372   0.449 0.654174    
## genRAC812        -94.000     94.372  -0.996 0.320310    
## genRAC813        -83.333     94.372  -0.883 0.378179    
## genRAC814        -72.333     94.372  -0.766 0.444214    
## genRAC815       -111.000     94.372  -1.176 0.240781    
## genRAC816        -66.333     94.372  -0.703 0.482862    
## genRAC817       -100.000     94.372  -1.060 0.290466    
## genRAC818       -107.000     94.372  -1.134 0.258101    
## genRAC819       -121.333     94.372  -1.286 0.199895    
## genRAC820         -1.000     94.372  -0.011 0.991555    
## genRAC821        -98.333     94.372  -1.042 0.298560    
## genROSELLA      -184.333     94.372  -1.953 0.052050 .  
## genSCHOMBURGK   -132.333     94.372  -1.402 0.162242    
## genSHRIKE       -128.000     94.372  -1.356 0.176376    
## genSPEAR        -254.667     94.372  -2.699 0.007502 ** 
## genSTILETTO     -157.000     94.372  -1.664 0.097603 .  
## genSUNBRI       -218.333     94.372  -2.314 0.021612 *  
## genSUNFIELD     -206.667     94.372  -2.190 0.029576 *  
## genSUNLAND      -182.667     94.372  -1.936 0.054192 .  
## genSWIFT        -197.000     94.372  -2.087 0.037990 *  
## genTASMAN       -161.000     94.372  -1.706 0.089410 .  
## genTATIARA       -64.333     94.372  -0.682 0.496142    
## genTINCURRIN     -19.000     81.728  -0.232 0.816382    
## genTRIDENT      -132.667     94.372  -1.406 0.161192    
## genVF299         -66.333     94.372  -0.703 0.482862    
## genVF300        -111.667     94.372  -1.183 0.237976    
## genVF302        -108.333     94.372  -1.148 0.252234    
## genVF508          11.667     94.372   0.124 0.901725    
## genVF519          -1.000     94.372  -0.011 0.991555    
## genVF655        -160.167     81.728  -1.960 0.051283 .  
## genVF664        -106.667     94.372  -1.130 0.259583    
## genVG127        -109.667     94.372  -1.162 0.246460    
## genVG503         -43.000     94.372  -0.456 0.649093    
## genVG506        -108.667     94.372  -1.151 0.250782    
## genVG701         -19.333     94.372  -0.205 0.837867    
## genVG714        -108.333     94.372  -1.148 0.252234    
## genVG878          52.333     94.372   0.555 0.579767    
## genWARBLER      -217.000     94.372  -2.299 0.022415 *  
## genWI216           4.000     94.372   0.042 0.966230    
## genWI221         -17.333     94.372  -0.184 0.854440    
## genWI231        -218.333     94.372  -2.314 0.021612 *  
## genWI232         -56.333     94.372  -0.597 0.551165    
## genWILGOYNE     -131.000     94.372  -1.388 0.166496    
## genWW1402       -117.333     94.372  -1.243 0.215071    
## genWW1477       -185.667     81.728  -2.272 0.024064 *  
## genWW1831        -86.667     94.372  -0.918 0.359435    
## genWYUNA        -176.667     94.372  -1.872 0.062524 .  
## genYARRALINKA   -245.000     94.372  -2.596 0.010061 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 115.6 on 221 degrees of freedom
## Multiple R-squared:  0.6226,    Adjusted R-squared:  0.4381 
## F-statistic: 3.375 on 108 and 221 DF,  p-value: 1.081e-14

13/20

Case study 4 Wheat yield in South Australia Part 6/9

14/20

Case study 4 Wheat yield in South Australia Part 7/9

Do you notice anything from below?

15/20

Case study 4  Wheat yield in South Australia Part 8/9

16/20

Case study 4 Wheat yield in South Australia Part 9/9

It's well known in agricultural field trials that spatial variations are introduced in traits; this could be because of the fertility trend, management practices or other reasons.
In the IDA stage, you investigate to identify these spatial variations - you cannot just simply fit a two-way ANOVA model!

17/20

"Teaching of Statistics should provide a more balanced blend of IDA and inference"

Chatfield (1985)

18/20

"Teaching of Statistics should provide a more balanced blend of IDA and inference"

Chatfield (1985)

Yet there is still very little emphasis of it in teaching and also at times in practice.

18/20

"Teaching of Statistics should provide a more balanced blend of IDA and inference"

Chatfield (1985)

Yet there is still very little emphasis of it in teaching and also at times in practice.

So don't forget to do IDA!

18/20

Take away messages

19/20

Take away messagesInitial data analysis (IDA) is a model-focussed exploration of data with two main objectives:
  data description including scrutinizing for data quality, and
model formulation without any formal statistical inference.

19/20

Take away messagesInitial data analysis (IDA) is a model-focussed exploration of data with two main objectives:
  data description including scrutinizing for data quality, and
model formulation without any formal statistical inference.

IDA hardly sees the limelight even if it's the very foundation of what the main analysis is built on.

19/20

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Lecturer: Emi Tanaka

ETC5521.Clayton-x@monash.edu

Week 3 - Session 2

20/20

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help