Binary variables as predictors

STAT1003 – Statistical Techniques

Dr. Emi Tanaka

Australian National University

These slides are best viewed on a modern browser like Google Chrome on a desktop or laptop. Some interactive components may require some time to fully load.

Case study: weight by gender

Different parameterisations of the same model

\[\texttt{weight}_i = \begin{cases} \mu_F + \varepsilon_i & \text{if female} \\ \mu_M + \varepsilon_i & \text{if male}\end{cases}\]

\[\texttt{weight}_i = \begin{cases} \gamma_0 + \varepsilon_i & \text{if female}\\ \gamma_0 + \gamma_1 + \varepsilon_i & \text{if male} \end{cases}\]

\[\texttt{weight}_i = \beta_1x_{1i} + \beta_2 x_{2i} + \varepsilon_i\] where

  • \(x_{1i} = 1\) if the \(i\)-th observation is female and \(0\) otherwise, and
  • \(x_{2i} = 1\) if the \(i\)-th observation is male and \(0\) otherwise.

Equivalence: \(\mu_F = \gamma_0 = \beta_1\) and \(\mu_M = \gamma_0 + \gamma_1 = \beta_2\).

Fitting the models

  • Note: -1 in the formula removes the intercept
  • I() allows us to include an expression as a predictor.
  • If a predictor is a factor variable, R will automatically create dummy variables for the categories and fit the model accordingly with the first category as the reference if not intercept is fitted.

Fitting categorical predictors in R

  • The above model is equivalent to: \[\texttt{weight}_i = \gamma_0 + \gamma_1 x_i + \varepsilon_i\] where \(x_i = 1\) if the \(i\)-th observation is male and \(0\) otherwise.

  • Recall \(\gamma_1 = \mu_M - \mu_F\) is the difference in mean weight between males and females, and \(\gamma_0 = \mu_F\) is the mean weight for females.

  • So the average weight for males is \(\hat{\gamma}_0 + \hat{\gamma}_1\) and the average weight for females is \(\hat{\gamma}_0\).

Two sample t-test as a special case of regression

Do you notice something between the two approaches?

Summary

  • Binary (or categorical) predictors can be included in a regression model using dummy variables or by treating them as factors in R.
  • The interpretation of coefficients for categorical predictors depends on the parameterisation of the model, but they all represent the same underlying relationships in the data.
  • When fitting regression models with categorical predictors, it is important to be mindful of the choice of reference category and how it affects the interpretation of coefficients.
  • The two-sample t-test with equal variance assumption is a special case of a regression model with a binary predictor, and both approaches will yield the same conclusions about the difference in means between groups.