Case study: Calcium pot trial

  • An experiment was conducted to assess the impact of four different calcium concentrations (levels A = 1, B = 5, C = 10, D = 20) on the root growth of plants.
  • The study followed a completely randomized design, with each treatment assigned to five individual plants growing in separate pots, for a total of 20 pots.
  • At the end of the experiment, the total root length (in cm) was measured for each pot.
  • Is there a significant difference in root growth between the different calcium concentrations?

One-way ANOVA model

  • A one-way ANOVA model is just a linear model where the response variable is modelled as a function of a single categorical explanatory variable.
  • For \(i=1, \ldots, t\) and \(j = 1, \ldots, n_i\), \[y_{ij} = \beta_0 + \beta_{1i} + \epsilon_{ij}, \qquad \epsilon_{ij}\stackrel{iid}{\sim} N(0, \sigma^2),\] where
    • \(t\) is the number of levels of the categorical variable and
    • \(n_i\) is the number of observations in the \(i\)-th level of the factor.
  • In this model, \(\beta_{1i}\) is the effect of the \(i\)-th level of the factor treatment.

Dummy variables for categorical variables

  • When we fit a model with categorical explanatory variables, categorical variables are converted into a numerical representations, e.g. using a set of dummy variables.
  • A dummy variable is a binary representation of one level of a categorical variable where 1 indicates the presence of that level and 0 indicating the absence of that level.

Treatment constraint

  • For \(i=1, \ldots, t\) and \(j = 1, \ldots, n_i\),

\[y_{ij} = \beta_0 + \beta_{1i} + \epsilon_{ij}, \qquad \epsilon_{ij}\stackrel{iid}{\sim} N(0, \sigma^2).\]

  • So far the coefficient estimate for \(\beta_{11}\) or it’s associated dummy variable is not shown.
  • By default, lm() uses the treatment constraint, where the first level of the factor is the reference level (i.e. \(\beta_{11} = 0\)).
  • In this case,
    • \(\beta_0\) represents the mean response for the reference level (level 1), and
    • \(\beta_{1i}\) represents the difference in mean response between the \(i\)-th level and the reference level (level 1).
  • The constraint is necessary to make the model identifiable.

Sum constraint

  • Another common constraint is the sum constraint, where \(\sum_{i=1}^t \beta_{1i} = 0\).
  • This constraint can be implemented by using the contr.sum contrast in R.
  • The last coefficient estimate is not shown, but it can be calculated as \(\beta_{1t} = -\sum_{i=1}^{t-1} \beta_{1i}\).
  • In this case,
    • \(\beta_{0}\) represents the grand mean (i.e. the average response across all levels of the factor), and
    • \(\beta_{1i}\) represents the difference in mean response between the \(i\)-th level and the grand mean.

Contrast

  • A contrast is a linear combination of the regression coefficients where the weights sum to zero.
  • Contrasts are often used to compare specific combinations of the levels of a factor.
  • E.g. if we want to compare the mean response of level 2 with the average of levels 1 and 3, we can define the contrast as \[C = \beta_{12} - \frac{1}{2}(\beta_{11} + \beta_{13})\] where the weights are 0, 1, -0.5, -0.5 and 0 for \(\beta_{0}, \beta_{11}, \beta_{12}, \beta_{13}, \beta_{14}\).
  • The estimates of contrasts are independent of the constraint used.

One-way ANOVA table*

  • The ANOVA table shows the decomposition of the sum of squares into different sources of variation.
  • The F-tests in ANOVA show the significance of the different sources of variation \[\underbrace{\sum_{i=1}^t \sum_{j=1}^{n_i} (y_{ij} - \bar{y})^2}_{\text{Total SS}} = \underbrace{\sum_{i=1}^t n_i (\bar{y}_i - \bar{y})^2}_{\text{Between SS}} + \underbrace{\sum_{i=1}^t \sum_{j=1}^{n_i} (y_{ij} - \bar{y}_i)^2}_{\text{Within SS / Residual SS}}\] where \(\bar{y}_i\) is the mean of the \(i\)-th group and \(\bar{y}\) is the overall mean.
Source of variation Degrees of freedom Sum of squares Mean square F-value P-value
Between groups \(t-1\) \(\text{Between SS}\) \(MS_{between} = \frac{\text{Between SS}}{t-1}\) \(f = \frac{MS_{between}}{MS_{within}}\) \(P(F_{t- 1, n - t} > f)\)
Within groups \(n-t\) \(\text{Within SS}\) \(MS_{within} = \frac{\text{Within SS}}{n-t}\)
Total \(n-1\) \(\text{Total SS}\)

\(\chi^2\) distribution

If \(X = Z_1^2 + \cdots + Z_k^2\) where \(Z_1, \ldots, Z_k\) are independent \(N(0, 1)\) variables, then \(X\) is said to have a chi-squared distribution with \(k\) degrees of freedom, denoted by \(\chi^2_k\).

F distribution

If \(F = \dfrac{X_1 / k_1}{X_2 / k_2}\) where \(X_1 \sim \chi^2_{k_1}\) and \(X_2 \sim \chi^2_{k_2}\), then \(F\) is said to have an F-distribution with \(k_1\) and \(k_2\) degrees of freedom, denoted by \(F_{k_1, k_2}\).

ANOVA table in R*

Categorical or numerical variable?

  • We’ve been using Calcium group as a factor, but sometimes it is unclear as to whether a particular explanatory variable should be regarded as a factor (categorical explanatory variable) or a numerical covariate.
  • If a predictor is a factor then you cannot use the model to predict for factor levels that were not observed in the data.
  • If a predictor is numerical, then you can interpolate (and extrapolate) for values not within the data.
  • In fact, you’ll find that if a factor with \(t\) levels corresponds to numerical values, then a polynomial regression model of degree \(t-1\) will be equivalent to the ANOVA model.

ANOVA model

  • The above model is equivalent to predicting the red points below:

Linear regression model

  • The above model is equivalent to predicting the blue line below:

Polynomial regression model

  • The above model is equivalent to predicting the purple curve below:

ANOVA vs. linear regression vs. polynomial regression

  • Notice that the purple lines go through all the red points – this is because the polynomial regression model is equivalent to the ANOVA model.

Example: Fertilizer brands on wheat yield

  • 9 fertilizer brands (labelled A, B, C, D, E, F, G, H, I) and a control (labelled 0) were tested on 20 wheat plots each.
  • The study wanted to identify which fertilizer brand is the least to most effective, and whether they were signficantly different from each other.

Estimated marginal means

Pairwise comparisons (with no adjustments)

Multiple testing

Source: xkcd

  • When conducting multiple tests, the probability of making a Type I error increases.
  • The more tests you conduct, the more likely you are to find a significant result by chance even if there is no true effect.
  • The Bonferroni correction is a simple way to adjust for multiple testing by dividing the significance level by the number of tests.

Data generating process for wheat yield

  • Notice here that there is actually no real differences between the fertilizers!
  • Any significant difference found is due to random chance.

Summary

  • When fitting linear models, categorical variables (factors) are converted to numerical values under the hood, e.g. by converting to dummy variables.
  • The choice of constraint (e.g. treatment constraint, sum constraint) does not affect the estimates of contrasts, but it does affect the interpretation of the regression coefficients.
  • ANOVA table shows the decomposition of the sum of squares into different sources of variation.
  • F-tests in ANOVA show the significance of the different sources of variation.
  • A \(\chi^2\) distribution arises from the sum of squares of independent standard normal variables.
  • An F-distribution arises from the ratio of two scaled chi-squared variables.
  • ANOVA model is a special case of linear regression models.
  • If a factor can be represented numerically, ANOVA model is a special case of a polynomial regression model.
  • Multiple testing increases the probability of making a Type I error, so the significance level should be adjusted accordingly (e.g., Bonferroni correction).