These slides are viewed best by Chrome or Firefox and occasionally need to be refreshed if elements did not load properly. See here for the PDF .

Press the right arrow to progress to the next slide!

1/18

ETC5521: Exploratory Data Analysis

Making comparisons between groups and strata

Lecturer: Emi Tanaka

ETC5521.Clayton-x@monash.edu

Week 6 - Session 2

1/18

Modelling and testing for comparisons2/18

Revisiting Case study 1 Pest resistance maize

data(gathmann.bt, package = "agridat")
df1 <- gathmann.bt %>% 
  pivot_longer(-gen, values_to = "abundance",
               names_to = "species") %>% 
  mutate(species = case_when(species=="thysan" ~ "Thrip",
                             TRUE ~ "Spider"))
skimr::skim(df1)

## ── Data Summary ────────────────────────
##                            Values
## Name                       df1   
## Number of rows             32    
## Number of columns          3     
## _______________________          
## Column type frequency:           
##   character                1     
##   factor                   1     
##   numeric                  1     
## ________________________         
## Group variables            None  
## 
## ── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate   min   max empty n_unique whitespace
## 1 species               0             1     5     6     0        2          0
## 
## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate ordered n_unique top_counts     
## 1 gen                   0             1 FALSE          2 Bt: 16, ISO: 16
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate  mean    sd    p0   p25   p50   p75  p100 hist 
## 1 abundance             0             1  6.01  6.36     0   0.6   2.5  10.7  18.4 ▇▂▃▁▂

ggplot(df1, aes(gen, abundance, color = gen)) +
  ggbeeswarm::geom_quasirandom(size = 3) + 
  facet_wrap(~species, scales = "free") + 
  scale_color_discrete_qualitative() + 
  guides(color = FALSE) + 
  labs(x = "", y = "Abundance") + 
  theme(axis.text.x = element_text(angle = 90, 
                                   vjust = 0.5, 
                                   hjust = 1))

The experiment compared abundance of spiders and thrips on Bt variety to the abundance of those on isogenic control variety.

3/18

Revisiting Case study 1 Pest resistance maize

data(gathmann.bt, package = "agridat")
df1 <- gathmann.bt %>% 
  pivot_longer(-gen, values_to = "abundance",
               names_to = "species") %>% 
  mutate(species = case_when(species=="thysan" ~ "Thrip",
                             TRUE ~ "Spider"))
skimr::skim(df1)

## ── Data Summary ────────────────────────
##                            Values
## Name                       df1   
## Number of rows             32    
## Number of columns          3     
## _______________________          
## Column type frequency:           
##   character                1     
##   factor                   1     
##   numeric                  1     
## ________________________         
## Group variables            None  
## 
## ── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate   min   max empty n_unique whitespace
## 1 species               0             1     5     6     0        2          0
## 
## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate ordered n_unique top_counts     
## 1 gen                   0             1 FALSE          2 Bt: 16, ISO: 16
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate  mean    sd    p0   p25   p50   p75  p100 hist 
## 1 abundance             0             1  6.01  6.36     0   0.6   2.5  10.7  18.4 ▇▂▃▁▂

ggplot(df1, aes(gen, abundance, color = gen)) +
  ggbeeswarm::geom_quasirandom(size = 3) + 
  facet_wrap(~species, scales = "free") + 
  scale_color_discrete_qualitative() + 
  guides(color = FALSE) + 
  labs(x = "", y = "Abundance") + 
  theme(axis.text.x = element_text(angle = 90, 
                                   vjust = 0.5, 
                                   hjust = 1))

The experiment compared abundance of spiders and thrips on Bt variety to the abundance of those on isogenic control variety.
Would you say that the abundance of spiders and/or thrips are comparable between Bt variety and isogenic variety?

3/18

Two-sample parametric tests: $t$ -test

Assumes the two samples are independent and from the $N(\mu_x, \sigma^2_x)$ and $N(\mu_y, \sigma^2_y)$ , respectively.

$H_0: \mu_x - \mu_y = 0~\text{ vs. }~H_1: \mu_x - \mu_y \neq 0$

$T^* = \frac{\bar{X} - \bar{Y}}{SE(\bar{X} - \bar{Y})}.$

Assuming $\sigma^2_x = \sigma^2_y$ , then $T^* \sim t_{n_x + n_y - 2}.$
A $100(1 - \alpha)\%$ confidence interval for $\mu_x - \mu_y$ is given as $(L, U)$ such that: $P(L<\mu_x - \mu_y < U) = 1 - \frac{\alpha}{2}.$
If $0 \in (L, U)$ , consistent with $H_0$ .

with(gathmann.bt,
     t.test(thysan[gen=="ISO"],
            thysan[gen=="Bt"],
            alternative = "two.sided",
            var.equal = TRUE,
            conf.level = 0.95))

## 
##     Two Sample t-test
## 
## data:  thysan[gen == "ISO"] and thysan[gen == "Bt"]
## t = -3.2182, df = 14, p-value = 0.006192
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -9.248813 -1.851187
## sample estimates:
## mean of x mean of y 
##     8.725    14.275

]

Note significance test suggested is different in Achim Gathmann et al. (2006) “Impact of Bt Maize Pollen (MON810) on Lepidopteran Larvae Living on Accompanying Weeds.” Molecular Ecology 15: 2677–85.

4/18

Confidence interval for two sample difference

In the right, a 95% confidence interval for population mean difference is constructed repeatedly, assuming population mean difference is Normally distributed, from 100 samples of the same population.
The population mean is zero.
Each confidence interval is calculated as $\bar{X} - \bar{Y} \pm \color{red}{t_{n-2, 0.975}}\times SE(\bar{X} - \bar{Y})$ where $t_{n-2, 0.975}$ is $t^*$ such that $P(t_{n-2} < t^*) = 0.975.$

5/18

Two sample non-parametric tests

Wilcoxon rank-sum test

Suppose that $X$ and $Y$ are randomly selected values from two populations.

$H_0: P(X > Y) = P(X < Y)$

vs.

$H_1: P(X>Y)\neq P(X<Y)$

All observations are ranked.
Test statistic is based on the sum of the ranks of one group.

with(gathmann.bt,
     wilcox.test(thysan[gen=="ISO"],
                 thysan[gen=="Bt"],
                 alternative = "two.sided",
                 conf.int = TRUE, 
                 conf.level = 0.95))

## 
##     Wilcoxon rank sum exact test
## 
## data:  thysan[gen == "ISO"] and thysan[gen == "Bt"]
## W = 7, p-value = 0.006993
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -9.4 -2.4
## sample estimates:
## difference in location 
##                   -6.3

6/18

Equivalence of tests to testing model parameters

##    gen thysan aranei
## 1   Bt   16.6   0.80
## 2   Bt   16.4   0.80
## 3   Bt   11.0   0.60
## 4   Bt   16.8   0.40
## 5   Bt   10.6   0.60
## 6   Bt   18.4   0.80
## 7   Bt   14.2   0.00
## 8   Bt   10.2   0.60
## 9  ISO    6.2   0.75
## 10 ISO   10.0   0.20
## 11 ISO   11.8   1.00
## 12 ISO   15.6   0.80
## 13 ISO    7.6   0.00
## 14 ISO    7.4   0.00
## 15 ISO    7.2   0.60
## 16 ISO    4.0   0.40

$\texttt{thysan}_i = \beta_0 + \beta_1\mathbb{I}(\texttt{gen}_i=\texttt{ISO}) + e_i$ where $e_i \sim NID(0, \sigma^2)$ .

The least squares estimate for $\hat{\beta}_1 = \bar{X} - \bar{Y}.$

lm(thysan ~ gen, data = gathmann.bt) %>% 
  confint("genISO", level = 0.95)

##            2.5 %    97.5 %
## genISO -9.248813 -1.851187

7/18

Equivalence of tests to testing model parameters

##    gen thysan aranei
## 1   Bt   16.6   0.80
## 2   Bt   16.4   0.80
## 3   Bt   11.0   0.60
## 4   Bt   16.8   0.40
## 5   Bt   10.6   0.60
## 6   Bt   18.4   0.80
## 7   Bt   14.2   0.00
## 8   Bt   10.2   0.60
## 9  ISO    6.2   0.75
## 10 ISO   10.0   0.20
## 11 ISO   11.8   1.00
## 12 ISO   15.6   0.80
## 13 ISO    7.6   0.00
## 14 ISO    7.4   0.00
## 15 ISO    7.2   0.60
## 16 ISO    4.0   0.40

$\texttt{thysan}_i = \beta_0 + \beta_1\mathbb{I}(\texttt{gen}_i=\texttt{ISO}) + e_i$ where $e_i \sim NID(0, \sigma^2)$ .

The least squares estimate for $\hat{\beta}_1 = \bar{X} - \bar{Y}.$

lm(thysan ~ gen, data = gathmann.bt) %>% 
  confint("genISO", level = 0.95)

##            2.5 %    97.5 %
## genISO -9.248813 -1.851187

Notice that the above confidence interval is the same confidence interval from the $t$ -test!

7/18

Revisiting Case study 4 Weight gain of calves Part 1/3

67 calves born in 1975 across 11 herds are fed of one of three diets with low, medium or high energy with their initial and final weights recorded.

data(urquhart.feedlot, package = "agridat")
df4 <- urquhart.feedlot %>% 
pivot_longer(c(weight1, weight2), 
           names_to = "when",
           values_to = "weight") %>% 
mutate(when = factor(as.character(when), 
                   labels = c("initial", "final"),
                   levels = c("weight1", "weight2")),
     diet = factor(diet, levels = c("High", "Medium", "Low")))
skimr::skim(df4)

## ── Data Summary ────────────────────────
##                            Values
## Name                       df4   
## Number of rows             134   
## Number of columns          5     
## _______________________          
## Column type frequency:           
##   factor                   2     
##   numeric                  3     
## ________________________         
## Group variables            None  
## 
## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate ordered n_unique top_counts               
## 1 diet                  0             1 FALSE          3 Low: 64, Med: 46, Hig: 24
## 2 when                  0             1 FALSE          2 ini: 67, fin: 67         
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate  mean    sd    p0   p25   p50    p75  p100 hist 
## 1 animal                0             1  34    19.4     1  17.2   34    50.8    67 ▇▇▇▇▇
## 2 herd                  0             1  25.6  10.7     3  19     31    34      36 ▂▁▂▂▇
## 3 weight                0             1 863.  182.    530 705    842. 1027    1248 ▃▇▃▆▃

ggplot(df4, aes(when, weight, color = diet, group = animal)) + 
geom_point(size = 3) + 
facet_wrap(~herd, nrow = 2) + 
geom_line() + 
labs(x = "", y = "Weight", color = "Diet")

8/18

Revisiting Case study 4 Weight gain of calves Part 2/3

Modelling the response as weight gain with diet factor:

coef(lm((weight2 - weight1) ~ diet, data = urquhart.feedlot))

## (Intercept)     dietLow  dietMedium 
##  332.666667   -4.666667  -33.971014

The herd is thought to be an important factor contributing to the response.

Modelling the response as weight gain with diet and herd factor:

# herd needs to be factor not integer
dat4 <- mutate(urquhart.feedlot,  herdf = factor(herd))
coef(lm((weight2 - weight1) ~ herdf + diet, data = dat4))

## (Intercept)      herdf9     herdf16     herdf19     herdf24     herdf31     herdf32     herdf33     herdf34     herdf35     herdf36     dietLow  dietMedium 
##  354.257353  -91.148529  -51.312039    7.410059  -63.221311   -4.666667  -51.189338  -36.083555  -15.179622   -3.423367  -34.974083    2.872233  -23.580408

Last model is the same as modelling the final weight with the initial weight as a covariate with slope fixed to 1:

coef(lm(weight2 ~ offset(weight1) + herdf + diet, 
    data = dat4))

## (Intercept)      herdf9     herdf16     herdf19     herdf24     herdf31     herdf32     herdf33     herdf34     herdf35     herdf36     dietLow  dietMedium 
##  354.257353  -91.148529  -51.312039    7.410059  -63.221311   -4.666667  -51.189338  -36.083555  -15.179622   -3.423367  -34.974083    2.872233  -23.580408

Estimating slope for initial weight from the data:

coef(lm(weight2 ~ weight1 + herdf + diet,
    data = dat4))

## (Intercept)     weight1      herdf9     herdf16     herdf19     herdf24     herdf31     herdf32     herdf33     herdf34     herdf35     herdf36     dietLow  dietMedium 
##  200.440174    1.243238  -79.102111  -51.238137   -6.864643  -75.406093  -33.044411  -56.517848  -62.152563  -23.610465  -29.219641  -75.555713    1.212633  -30.912720

9/18

Revisiting Case study 4 Weight gain of calves Part 3/3

dat4 <- lm(weight2 ~ weight1 + herdf + diet, data = dat4) %>% broom::augment() 
ggplot(dat4, aes(.fitted, .resid)) +
  geom_point(data = select(dat4, -herdf), size = 2, color = "gray") + 
  geom_point(size = 2, aes(color = herdf)) + 
  geom_hline(yintercept = 0) +
  labs(x = "Fitted values", y = "Residual") +
  scale_color_discrete_qualitative() +
  facet_wrap(~herdf, nrow = 2) + guides(color = FALSE)

10/18

Case study 10 Soil nitrogen Part 1/3

data(gomez.nitrogen, package = "agridat")
skimr::skim(gomez.nitrogen)

## ── Data Summary ────────────────────────
##                            Values        
## Name                       gomez.nitrogen
## Number of rows             96            
## Number of columns          4             
## _______________________                  
## Column type frequency:                   
##   factor                   3             
##   numeric                  1             
## ________________________                 
## Group variables            None          
## 
## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate ordered n_unique top_counts                    
## 1 trt                   0             1 FALSE          8 T1: 12, T2: 12, T3: 12, T4: 12
## 2 rep                   0             1 FALSE          4 R1: 24, R2: 24, R3: 24, R4: 24
## 3 stage                 0             1 FALSE          3 P1: 32, P2: 32, P3: 32        
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate  mean    sd    p0   p25   p50   p75  p100 hist 
## 1 nitro                 0             1  2.24 0.796  1.11  1.46  2.21  2.93  3.84 ▇▅▅▅▃

ggplot(gomez.nitrogen, aes(trt, nitro, color = stage)) + 
  geom_point(size = 3) + 
  scale_color_discrete_qualitative() +
  labs(x = "Fertilizer treatment", 
       y = "Soil nitrogen content (%)",
       color = "Growth stage")

Soil nitrogen content with 8 different fertilizer treatment is measured at 3 growth stage:
- P1 = 15 days post transplanting
- P2 = 40 days post transplanting
- P3 = panicle initiation

11/18

Case study 10 Soil nitrogen Part 1/3

data(gomez.nitrogen, package = "agridat")
skimr::skim(gomez.nitrogen)

## ── Data Summary ────────────────────────
##                            Values        
## Name                       gomez.nitrogen
## Number of rows             96            
## Number of columns          4             
## _______________________                  
## Column type frequency:                   
##   factor                   3             
##   numeric                  1             
## ________________________                 
## Group variables            None          
## 
## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate ordered n_unique top_counts                    
## 1 trt                   0             1 FALSE          8 T1: 12, T2: 12, T3: 12, T4: 12
## 2 rep                   0             1 FALSE          4 R1: 24, R2: 24, R3: 24, R4: 24
## 3 stage                 0             1 FALSE          3 P1: 32, P2: 32, P3: 32        
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate  mean    sd    p0   p25   p50   p75  p100 hist 
## 1 nitro                 0             1  2.24 0.796  1.11  1.46  2.21  2.93  3.84 ▇▅▅▅▃

ggplot(gomez.nitrogen, aes(trt, nitro, color = stage)) + 
  geom_point(size = 3) + 
  scale_color_discrete_qualitative() +
  labs(x = "Fertilizer treatment", 
       y = "Soil nitrogen content (%)",
       color = "Growth stage")

Soil nitrogen content with 8 different fertilizer treatment is measured at 3 growth stage:
- P1 = 15 days post transplanting
- P2 = 40 days post transplanting
- P3 = panicle initiation
Clearly the growth stage affects the soil nitrogen content but this makes it hard to compare the fertilizer treatments.

11/18

Case study 10 Soil nitrogen Part 1/3

data(gomez.nitrogen, package = "agridat")
skimr::skim(gomez.nitrogen)

## ── Data Summary ────────────────────────
##                            Values        
## Name                       gomez.nitrogen
## Number of rows             96            
## Number of columns          4             
## _______________________                  
## Column type frequency:                   
##   factor                   3             
##   numeric                  1             
## ________________________                 
## Group variables            None          
## 
## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate ordered n_unique top_counts                    
## 1 trt                   0             1 FALSE          8 T1: 12, T2: 12, T3: 12, T4: 12
## 2 rep                   0             1 FALSE          4 R1: 24, R2: 24, R3: 24, R4: 24
## 3 stage                 0             1 FALSE          3 P1: 32, P2: 32, P3: 32        
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate  mean    sd    p0   p25   p50   p75  p100 hist 
## 1 nitro                 0             1  2.24 0.796  1.11  1.46  2.21  2.93  3.84 ▇▅▅▅▃

ggplot(gomez.nitrogen, aes(trt, nitro, color = stage)) + 
  geom_point(size = 3) + 
  scale_color_discrete_qualitative() +
  labs(x = "Fertilizer treatment", 
       y = "Soil nitrogen content (%)",
       color = "Growth stage")

Soil nitrogen content with 8 different fertilizer treatment is measured at 3 growth stage:
- P1 = 15 days post transplanting
- P2 = 40 days post transplanting
- P3 = panicle initiation
Clearly the growth stage affects the soil nitrogen content but this makes it hard to compare the fertilizer treatments.
Let's model the nitrogen content as:

lm(nitro ~ stage + trt, data = gomez.nitrogen)

11/18

Case study 10 Soil nitrogen Part 2/3

Considering just the stage effect:

fit1 <- lm(nitro ~ stage, data = gomez.nitrogen)
fit1data <- broom::augment(fit1) %>% 
left_join(gomez.nitrogen, by=c("nitro", "stage")) %>% 
mutate(trt = fct_reorder(trt, .resid))
ggplot(fit1data, aes(trt, .resid)) +
geom_boxplot() + 
labs(x = "Fertilizer treatment", 
   y = "Residual of fit1")

12/18

Case study 10 Soil nitrogen Part 2/3

Considering just the stage effect:

fit1 <- lm(nitro ~ stage, data = gomez.nitrogen)
fit1data <- broom::augment(fit1) %>% 
left_join(gomez.nitrogen, by=c("nitro", "stage")) %>% 
mutate(trt = fct_reorder(trt, .resid))
ggplot(fit1data, aes(trt, .resid)) +
geom_boxplot() + 
labs(x = "Fertilizer treatment", 
   y = "Residual of fit1")

Here we expect no pattern:

fit2 <- lm(nitro ~ stage + trt, 
           data = gomez.nitrogen)
fit2data <- broom::augment(fit2) %>% 
  mutate(trt = fct_reorder(trt, .resid))
ggplot(fit2data, aes(trt, .resid)) +
  geom_boxplot() + 
  labs(x = "Fertilizer treatment", 
       y = "Residual of fit2")

12/18

Case study 10 Soil nitrogen Part 3/3

library(emmeans)
confint(pairs(emmeans(fit2, "trt"), adjust="none"))

##  contrast estimate    SE df lower.CL  upper.CL
##  T1 - T2   -0.2117 0.116 86  -0.4420  0.018654
##  T1 - T3   -0.3375 0.116 86  -0.5678 -0.107180
##  T1 - T4   -0.2308 0.116 86  -0.4612 -0.000513
##  T1 - T5   -0.0717 0.116 86  -0.3020  0.158654
##  T1 - T6   -0.1492 0.116 86  -0.3795  0.081154
##  T1 - T7   -0.3592 0.116 86  -0.5895 -0.128846
##  T1 - T8   -0.2333 0.116 86  -0.4637 -0.003013
##  T2 - T3   -0.1258 0.116 86  -0.3562  0.104487
##  T2 - T4   -0.0192 0.116 86  -0.2495  0.211154
##  T2 - T5    0.1400 0.116 86  -0.0903  0.370320
##  T2 - T6    0.0625 0.116 86  -0.1678  0.292820
##  T2 - T7   -0.1475 0.116 86  -0.3778  0.082820
##  T2 - T8   -0.0217 0.116 86  -0.2520  0.208654
##  T3 - T4    0.1067 0.116 86  -0.1237  0.336987
##  T3 - T5    0.2658 0.116 86   0.0355  0.496154
##  T3 - T6    0.1883 0.116 86  -0.0420  0.418654
##  T3 - T7   -0.0217 0.116 86  -0.2520  0.208654
##  T3 - T8    0.1042 0.116 86  -0.1262  0.334487
##  T4 - T5    0.1592 0.116 86  -0.0712  0.389487
##  T4 - T6    0.0817 0.116 86  -0.1487  0.311987
##  T4 - T7   -0.1283 0.116 86  -0.3587  0.101987
##  T4 - T8   -0.0025 0.116 86  -0.2328  0.227820
##  T5 - T6   -0.0775 0.116 86  -0.3078  0.152820
##  T5 - T7   -0.2875 0.116 86  -0.5178 -0.057180
##  T5 - T8   -0.1617 0.116 86  -0.3920  0.068654
##  T6 - T7   -0.2100 0.116 86  -0.4403  0.020320
##  T6 - T8   -0.0842 0.116 86  -0.3145  0.146154
##  T7 - T8    0.1258 0.116 86  -0.1045  0.356154
## 
## Results are averaged over the levels of: stage 
## Confidence level used: 0.95

From above, the 6 pairs of treatments: T3 & T5, T1 & T4, T1 & T8, T6 & T7, T1 & T3, T1 & T7 are significantly different.
These confidence intervals are constructed without taking any regard for others.

13/18

Controlling the family-wise error rate

Unadjusted

Each interval has been constructed using a procedure so that when the model is correct, the probability that the "correct" population contrast is covered is 0.95. . . individually.

$\bar{X} - \bar{Y} \pm \color{red}{t_{n-t,1 - \alpha/2}}\times SE(\bar{X} - \bar{Y})$ where $\alpha = 0.05$ and $t$ is the number of treatments.

But, what is the probability that all intervals cover their corresponding true values simultaneously?

Bonferonni adjustment

We can adjust the individual $100(1-\alpha)\%$ confidence intervals so

$\bar{X} - \bar{Y} \pm \color{red}{t_{n-t,1 - \alpha/(2m)}}\times SE(\bar{X} - \bar{Y})$ where $m$ is the number of pairwise comparisons.

So for 8 treatments, the number of pairwise comparisons is

choose(8, 2)

## [1] 28

14/18

Bonferonni adjusted confidence interval

confint(pairs(emmeans(fit2, "trt"),
              adjust="bonferroni"))

##  contrast estimate    SE df lower.CL upper.CL
##  T1 - T2   -0.2117 0.116 86   -0.585   0.1619
##  T1 - T3   -0.3375 0.116 86   -0.711   0.0361
##  T1 - T4   -0.2308 0.116 86   -0.604   0.1427
##  T1 - T5   -0.0717 0.116 86   -0.445   0.3019
##  T1 - T6   -0.1492 0.116 86   -0.523   0.2244
##  T1 - T7   -0.3592 0.116 86   -0.733   0.0144
##  T1 - T8   -0.2333 0.116 86   -0.607   0.1402
##  T2 - T3   -0.1258 0.116 86   -0.499   0.2477
##  T2 - T4   -0.0192 0.116 86   -0.393   0.3544
##  T2 - T5    0.1400 0.116 86   -0.234   0.5136
##  T2 - T6    0.0625 0.116 86   -0.311   0.4361
##  T2 - T7   -0.1475 0.116 86   -0.521   0.2261
##  T2 - T8   -0.0217 0.116 86   -0.395   0.3519
##  T3 - T4    0.1067 0.116 86   -0.267   0.4802
##  T3 - T5    0.2658 0.116 86   -0.108   0.6394
##  T3 - T6    0.1883 0.116 86   -0.185   0.5619
##  T3 - T7   -0.0217 0.116 86   -0.395   0.3519
##  T3 - T8    0.1042 0.116 86   -0.269   0.4777
##  T4 - T5    0.1592 0.116 86   -0.214   0.5327
##  T4 - T6    0.0817 0.116 86   -0.292   0.4552
##  T4 - T7   -0.1283 0.116 86   -0.502   0.2452
##  T4 - T8   -0.0025 0.116 86   -0.376   0.3711
##  T5 - T6   -0.0775 0.116 86   -0.451   0.2961
##  T5 - T7   -0.2875 0.116 86   -0.661   0.0861
##  T5 - T8   -0.1617 0.116 86   -0.535   0.2119
##  T6 - T7   -0.2100 0.116 86   -0.584   0.1636
##  T6 - T8   -0.0842 0.116 86   -0.458   0.2894
##  T7 - T8    0.1258 0.116 86   -0.248   0.4994
## 
## Results are averaged over the levels of: stage 
## Confidence level used: 0.95 
## Conf-level adjustment: bonferroni method for 28 estimates

Now none are significantly different.
Note: Bonferroni adjustment is quite conservative.

15/18

Example 1 Mystery data Part 1/2

Many inferences, e.g. using confidence intervals or $p$ -values, are based on assumptions being met.
From the model fit below can we suggest the following model?

$\hat{Y} = -0.002 + 0.979x_1 + 0.998x_2 + 0.973x_3 + 0.995x_4$

lm(y ~ x1 + x2 + x3 + x4, data=mystery_data) %>% broom::tidy()

## # A tibble: 5 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept) -0.00204    0.0109    -0.187   0.852
## 2 x1           0.979      0.0151    64.8     0    
## 3 x2           0.998      0.0155    64.4     0    
## 4 x3           0.973      0.0154    63.1     0    
## 5 x4           0.995      0.0109    91.6     0

16/18

Example 1 Mystery data Part 2/2

lm(y ~ x1 + x2 + x3 + x4, data=mystery_data) %>% 
  broom::augment() %>% 
  ggplot(aes(.fitted, .resid)) + 
  geom_point() + 
  labs(x="Fitted Values", y="Residual")

Moral of the story:

Don't forget to check model diagnostics.

17/18

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Lecturer: Emi Tanaka

ETC5521.Clayton-x@monash.edu

Week 6 - Session 2

18/18

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

ETC5521: Exploratory Data Analysis

Making comparisons between groups and strata

Modelling and testing for comparisons

Revisiting Case study 1 Pest resistance maize

Revisiting Case study 1 Pest resistance maize

Two-sample parametric tests: tt-test

Confidence interval for two sample difference

Two sample non-parametric tests

Wilcoxon rank-sum test

Equivalence of tests to testing model parameters

Equivalence of tests to testing model parameters

Revisiting Case study 4 Weight gain of calves Part 1/3

Revisiting Case study 4 Weight gain of calves Part 2/3

Revisiting Case study 4 Weight gain of calves Part 3/3

Case study 10 Soil nitrogen Part 1/3

Case study 10 Soil nitrogen Part 1/3

Case study 10 Soil nitrogen Part 1/3

Case study 10 Soil nitrogen Part 2/3

Case study 10 Soil nitrogen Part 2/3

Case study 10 Soil nitrogen Part 3/3

Controlling the family-wise error rate

Bonferonni adjusted confidence interval

Example 1 Mystery data Part 1/2

Example 1 Mystery data Part 2/2

Moral of the story:

Don't forget to check model diagnostics.

ETC5521: Exploratory Data Analysis

Making comparisons between groups and strata

Help

Two-sample parametric tests: $t$ -test