Multiple Linear Regression

STAT1003 – Statistical Techniques

Dr. Emi Tanaka

Australian National University

These slides are best viewed on a modern browser like Google Chrome on a desktop or laptop. Some interactive components may require some time to fully load.

Multiple Linear Regression

Case study: advertising experiment

An experiment was conducted to investigate the effect of advertising medias on sales. The predictors consist of advertising budget (in thousands of dollars) for youtube, facebook and newspaper, and the response variable is sales (in thousands of units).

Multiple linear regression

Suppose we have a multivariable data set \(\{x_{i1}, x_{i2}, \ldots, x_{ik}, y_i\}_{i=1}^n\) where \(x_{ij}\) is the value of the \(j\)-th explanatory variable and \(y_i\) is the value of the response variable for the \(i\)-th observation.

The general multiple linear regression model is: \[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_k x_{ik} + \varepsilon_i\] where:

\(\beta_0, \beta_1, \ldots, \beta_k\) are regression coefficients, and
\(\varepsilon_i\) is the random error term.

We let \(p = k + 1\) be the total number of coefficients (including the intercept) in the model.
We assume \(\epsilon_i\) are i.i.d. \(N(0, \sigma^2)\) random variables, independent of the \(x_{ij}\)’s.
Checking for model assumptions are same as in simple linear regression.

Multiple linear regression in matrix form

The multiple linear regression model can be expressed in matrix form as:

\[\boldsymbol{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\]

where:

\(\boldsymbol{y} = (y_1, y_2, \ldots, y_n)^\top\) is an \(n \times 1\) vector of responses,
\(\mathbf{X} = \begin{pmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1k} \\ 1 & x_{21} & x_{22} & \cdots & x_{2k} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{nk} \end{pmatrix}\) is an \(n \times p\) design matrix,
\(\boldsymbol{\beta} = (\beta_0, \beta_1, \ldots, \beta_k)^\top\) is a \(p \times 1\) vector of coefficients, and
\(\boldsymbol{\varepsilon} = (\varepsilon_1, \varepsilon_2, \ldots, \varepsilon_n)^\top\) is an \(n \times 1\) vector of errors and \(\boldsymbol{\varepsilon} \sim N(\mathbf{0}, \sigma^2 \mathbf{I}_n)\).

Least squares estimation

The least squares estimator of the coefficients \(\boldsymbol{\beta}\) is given by:

\[\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \boldsymbol{y}.\]

Fitting multiple linear regression in R

Interpretation of regression coefficients

Each coefficient \(\beta_j\) represents the expected change in \(y\) for a one-unit increase in \(x_j\), holding all other predictors constant.
This interpretation relies on the assumption that the model is correctly specified and that all other predictors are included in the model.

Inference for regression parameters

\[\text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1}.\]

\(H_0: \beta_j = 0\) vs \(H_A: \beta_j \neq 0\)

If the model assumptions are satisfied, the test statistic is:

\[t^* = \frac{\hat{\beta}_j - \beta_j}{\text{SE}(\hat{\beta}_j)}\]

\(T \sim t_{n-p}\) under \(H_0\)
P-value \(= 2P(T > |t^*|)\)

Visual diagnostics

There is clearly a number of violations of the model assumptions!
We may consider transforming the response variable to improve the model fit.
As the newspaper budget doesn’t seem to have a significant effect on sales, we may consider removing it from the model.

Revised final model

Symbolic model formula

The model formula in R provides a convenient way to specify the structure of the regression model, including main effects and interaction effects.
Intercept included by default:
y ~ 1 + x1 + x2 is equivalent to y ~ x1 + x2.
Removing intercept:
y ~ 0 + x1 + x2 and y ~ -1 + x1 + x2 both remove the intercept term in the model.
Main and interaction effects:
y ~ x1 * x2 is equivalent to y ~ x1 + x2 + x1:x2.
y ~ x1 * x2 * x3 is equivalent to
y ~ x1 + x2 + x3 + x1:x2 + x1:x3 + x2:x3 + x1:x2:x3.
Main effects and two-way interaction effects only:
y ~ (x1 + x2 + x3)^2 and y ~ x1 * x2 * x3 - x1:x2:x3 is equivalent to
y ~ x1 + x2 + x3 + x1:x2 + x1:x3 + x2:x3.

Fitting interaction effects

Assuming that x1 and x2 are numerical variables, the following model

lm(log(y) ~ 1 + log(x1) + x2 + log(x1):x2)
is equivalent to fitting the model:

\(\log(y_i)\) \(=\) \(\beta_0\)\(1\) \(+\) \(\beta_1\) \(\log(x_1)\) \(+\) \(\beta_2\) \(x_2\) \(+\) \(\beta_3\) \(\log(x_1)\times x_2\) \(+ \epsilon_i\),

assuming \(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)\)
\(\beta_1\) and \(\beta_2\) are referred to as the main effects.
\(\beta_3\) is the interaction effect.

Equivalent models

Observation 131 and 156 were removed as they were deemed as outliers.
The term log(youtube) * facebook allows the effect of (log of) youtube advertisement on sales to depend on the level of facebook advertising, and vice versa.
The three models below are equivalent:

Comparing models

Dissecting the final model

\[\class{highlight mark-yellow}{\hat{\log(\texttt{sales}) }\approx 1.14 + 0.27\times \log(\texttt{youtube}) - 0.00025\times \texttt{facebook} + 0.0024 \times \log(\texttt{youtube})\times \texttt{facebook}}\]

taking the exponential of both sides:

\[\hat{\texttt{sales} }\approx\texttt{youtube}^{0.27 + 0.0024 \times \texttt{facebook}}\exp\left(1.14 - 0.00025\times \texttt{facebook}\right)\]

Prediction from the final model

The predicted sales with a youtube advertising budget of \(50\) and a facebook advertising budget of \(30\) is:

\[\hat{\log(\texttt{sales}) }\approx 1.14 + 0.27\times \log(50) - 0.00025\times 30 + 0.0024 \times \log(50)\times 30 \approx 2.47\]

Coefficient of determination \(R^2\)

\[\begin{align*} \text{Total SS} &= \text{Regression SS} + \text{Residual SS}\\ \sum_{i=1}^n (y_i - \bar{y})^2 &= \sum_{i=1}^n (\hat{y}_i - \bar{y})^2 + \sum_{i=1}^n (y_i - \hat{y}_i)^2 \end{align*}\]

The coefficient of determination \(R^2\) measures the proportion of variability in the response explained by the model:

\[R^2 = 1 - \frac{\text{Residual SS}}{\text{Total SS}}\]

\(R^2\) is a common measure of goodness of fit for linear regression models.
\(R^2\) ranges from 0 to 1, with higher values indicating a better fit of the model to the data.

Alternative expressions for \(R^2\)

If the linear regression includes an intercept term, then \(R^2\) can be calculated as the square of the correlation between the observed and fitted values of the response variable: \[R^2 = r_{y, \hat{y}}^2 = \frac{\left(\sum_{i=1}^n (y_i - \bar{y})(\hat{y}_i - \bar{y})\right)^2}{\sum_{i=1}^n (y_i - \bar{y})^2\sum_{i=1}^n (\hat{y}_i - \bar{y})^2} = \dfrac{\sum_{i=1}^n (\hat{y}_i - \bar{y})^2}{\sum_{i=1}^n (y_i - \bar{y})^2}\]

In a simple linear regression, \(R^2\) is also equal to \(r^2_{xy}\) (the square of the sample correlation between \(y\) and \(x\)).

Adjusted \(R^2\)

Adding predictors will never decrease \(R^2\), even if the predictors are not useful.

Adjusted \(R^2\) penalizes the inclusion of unnecessary predictors:

\[ R^2_{\text{adj}} = 1 - \frac{(1 - R^2)(n - 1)}{n - p} \]

This measure may decrease if a new predictor does not improve the model sufficiently.

Multicollinearity

If two predictors are perfectly correlated, the design matrix \(\mathbf{X}\) will not be of full rank, and the least squares estimator \(\hat{\boldsymbol{\beta}}\) will not be uniquely defined.
In practice, even high but not perfect correlation can lead to issues with multicollinearity.

Multicollinearity occurs when predictors are highly correlated with each other. This can lead to:

Unstable coefficient estimates
Large standard errors of the coefficient estimates
Difficulty in interpreting the individual effects of predictors

If multicollinearity is present, it may be necessary to remove or combine correlated predictors, or use approaches (out of scope for this course).

Pairwise correlation and scatterplots

Summary

Multiple linear regression allows us to model the relationship between a response variable and multiple predictors.
A symbolic model formula in R provides a convenient way to specify the structure of the regression model, including main effects and interaction effects.
Interpretation of coefficients must always consider that other variables are held constant.
Many of the same tools for model diagnostics and inference in simple linear regression can be extended to multiple linear regression.
The validity of conclusions depends on checking model assumptions and being cautious about multicollinearity.

Binary Variables as Predictors

Case study: weight by gender

Different parameterisations of the same model

\[\texttt{weight}_i = \begin{cases} \mu_F + \varepsilon_i & \text{if female} \\ \mu_M + \varepsilon_i & \text{if male}\end{cases}\]

\[\texttt{weight}_i = \begin{cases} \gamma_0 + \varepsilon_i & \text{if female}\\ \gamma_0 + \gamma_1 + \varepsilon_i & \text{if male} \end{cases}\]

\[\texttt{weight}_i = \beta_1x_{1i} + \beta_2 x_{2i} + \varepsilon_i\] where

\(x_{1i} = 1\) if the \(i\)-th observation is female and \(0\) otherwise, and
\(x_{2i} = 1\) if the \(i\)-th observation is male and \(0\) otherwise.

Equivalence: \(\mu_F = \gamma_0 = \beta_1\) and \(\mu_M = \gamma_0 + \gamma_1 = \beta_2\).

Fitting the models

Note: -1 in the formula removes the intercept
I() allows us to include an expression as a predictor.
If a predictor is a factor variable, R will automatically create dummy variables for the categories and fit the model accordingly with the first category as the reference if not intercept is fitted.

Fitting categorical predictors in R

The above model is equivalent to: \[\texttt{weight}_i = \gamma_0 + \gamma_1 x_i + \varepsilon_i\] where \(x_i = 1\) if the \(i\)-th observation is male and \(0\) otherwise.
Recall \(\gamma_1 = \mu_M - \mu_F\) is the difference in mean weight between males and females, and \(\gamma_0 = \mu_F\) is the mean weight for females.
So the average weight for males is \(\hat{\gamma}_0 + \hat{\gamma}_1\) and the average weight for females is \(\hat{\gamma}_0\).

Two sample t-test as a special case of regression

Do you notice something between the two approaches?

Summary

Binary (or categorical) predictors can be included in a regression model using dummy variables or by treating them as factors in R.
The interpretation of coefficients for categorical predictors depends on the parameterisation of the model, but they all represent the same underlying relationships in the data.
When fitting regression models with categorical predictors, it is important to be mindful of the choice of reference category and how it affects the interpretation of coefficients.
The two-sample t-test with equal variance assumption is a special case of a regression model with a binary predictor, and both approaches will yield the same conclusions about the difference in means between groups.