Simple Linear Regression

STAT1003 – Statistical Techniques

Dr. Emi Tanaka

Australian National University

These slides are best viewed on a modern browser like Google Chrome on a desktop or laptop. Some interactive components may require some time to fully load.

Simple Linear Regression

Case: Seed weight from seed length

viewof b0 = Inputs.range([-50, 50], {step: 0.1, value: -27, label: "Intercept", style: {
    "--label-width": "300px!important",
    "width": "300px"
  }})
viewof b1 = Inputs.range([-50, 50], {step: 0.1, value: 17, label: "Slope", style: {
    "--label-width": "150px",
    "width": "300px"
  }})

Seed length is expected to be a major contributor to differences in seed weight for wheat.
190 seeds selected at random from a line of diploid wheat, Triticum monococcum for length and weight.

We seek to model the relationship between:

the mean of a response variable, \(y\), and
a single explanatory variable (or predictor/covariate) \(x\) as:

\[y = f(x) = \beta_0 + \beta_1 x.\]

Simple linear regression

Suppose we have a bivariate data set \(\{(x_i, y_i)\}_{i=1}^n\) where \(x_i\) is the value of the explanatory variable and \(y_i\) is the value of the response variable for the \(i\)-th observation.
For observations \(i = 1, 2, \ldots, n\):

\(y_i =\) \(\beta_0\) \(+\) \(\beta_1\)\(x_i +\) \(\epsilon_i\)

\(\boldsymbol{\beta} = \begin{bmatrix}\beta_0 \\ \beta_1\end{bmatrix}\) are referred to as the regression parameters/coefficients.
We often assume \(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)\), i.e. independent and identically distributed as Normal distribution of mean 0 and variance \(\sigma^2\).

intercept slope error for the \(i\)-th observation

Residual sum of squares

Find \(\hat \beta_0\) and \(\hat \beta_1\) that minimize the sum of squares:

\[\text{RSS}(\beta_0, \beta_1) = \sum_{i=1}^n \left(\underbrace{y_i - (\beta_0 + \beta_1 x_i)}_{\text{residual}}\right)^2\]

The least squares estimates can be found by using calculus.
Visually, we can try changing the regression parameters below:

viewof intercept_1 = Inputs.number({step: 0.1, value: -27.9, label: "β̂₀"})
viewof slope_1 = Inputs.number({step: 0.1, value: 17.2, label: "β̂1"})

Least squares estimates

\[\begin{align*} \frac{\partial \text{RSS}}{\partial \beta_0} &= -2\sum_{i=1}^n \left(y_i - (\beta_0 + \beta_1 x_i)\right) = 0\\ \frac{\partial \text{RSS}}{\partial \beta_1} &= -2\sum_{i=1}^n x_i \left(y_i - (\beta_0 + \beta_1 x_i)\right) = 0 \end{align*}\]

Solving the above equations gives the least squares estimates:

\[\begin{align*} \hat{\beta}_1 &= \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} = \frac{s_{xy}}{s_x^2}\\ \hat{\beta}_0 &= \bar{y} - \hat{\beta}_1 \bar{x} \end{align*}\]

where:

\(s_{xy} = \frac{1}{n - 1}\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})\) is the sample covariance between \(x\) and \(y\), and
\(s_x^2 = \frac{1}{n - 1}\sum_{i=1}^n (x_i - \bar{x})^2\) is the sample variance of \(x\).
\(\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i\) and \(\bar{y} = \frac{1}{n}\sum_{i=1}^n y_i\) are the sample means of \(x\) and \(y\), respectively.

Fitting linear models with R

\[\texttt{Weight}_i=\beta_0 + \beta_1\texttt{Length}_i + e_i\]

Recall: \(\hat{\beta}_1 = \frac{s_{xy}}{s_x^2}\) and \(\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\).

The least square estimates are: \(\hat{\beta}_0 = -27.93\) and \(\hat{\beta}_1 = 17.17\).

Model objects in R

When you fit a model, you often get a model object.
What does fit even contain?

There are often methods to extract the information you need from the model object, such as coef(), fitted(), predict(), residuals(), and sigma().

Extracting model parameter estimates

This gives us \(\class{highlight mark-yellow}{\hat{\beta}_0} = \bar{y} - \hat{\beta}_1 \bar{x}\) and \(\class{highlight mark-yellow}{\hat{\beta}_1} = \frac{s_{xy}}{s_x^2}\).

But what about \(\sigma^2\)? Recall \(e_i \sim NID(0, \sigma^2)\).

An unbiased estimate of \(\sigma^2 = \text{RSS} / (n - p)\) where \(p = 2\) for simple linear regression, can be obtained by:

Fitted values

The fitted values are the red points given as:

\[\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i.\]

Predicted values

The response can be predicted from the fitted model for a given \(x\) as:

\[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x.\]

Residuals

The residuals are the vertical distances from the fitted line to the observed points, given as:

\[\hat{\epsilon}_i = y_i - \hat{y}_i = y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i).\]

Plotting linear models

geom_smooth() makes it easy to add the model to a scatter plot
ggpubr::stat_regline_equation() adds the regression line to the plot

Correlation coefficient and slope

Recall correlation coefficient \[r = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}} = \frac{s_{xy}}{s_x s_y}\] is a measure of the strength and direction of the linear relationship between two variables.
Relationship between the slope and the correlation coefficient:

\[\hat{\beta}_1 = \frac{s_{xy}}{s_{x}^2} = \frac{s_{xy}}{s_x s_y}\frac{s_y}{s_x} = r \frac{s_y}{s_x}.\]

Since \(s_y \geq 0\) and \(s_x \geq 0\) (but \(s_x \neq 0\) here), the sign of \(\hat{\beta}_1\) is the same as the sign of \(r\).

viewof nsample = Inputs.number([20, 1000], {step: 20, value: 200, label: "n"})
viewof intercept = Inputs.range([-10, 10], {step: 0.05, value: 0.8, label: "β₀"})
viewof slope = Inputs.range([-10,10], {step: 0.05, value: 0.8, label: "β₁"})
viewof sigma = Inputs.range([0.1, 20], {step: 0.1, value: 3, label: "σ"})

Summary

\[y_i = \beta_0 + \beta_1 x_i + \epsilon_i\]

where \(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)\) for \(i = 1, 2, \ldots, n\).

\(\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} = \dfrac{s_{xy}}{s_{x}^2} = r \dfrac{s_y}{s_x}\) and \(\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\).
Fitted values: \(\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i\).
Residuals: \(\hat{\epsilon}_i = y_i - \hat{y}_i\).
Residual sum of squares: \(\text{RSS} = \sum_{i=1}^n \hat{\epsilon}_i^2.\)
\(\hat{\sigma}^2 = \text{RSS} / (n - p)\) where \(p = 2\).
The sign of \(\hat{\beta}_1\) is the same as the sign of the correlation coefficient \(r\).

Regression Diagnostics

Same regression line, different data

We can fit a regression model for any data but that doesn’t mean the model is appropriate for the data.
All four sets of data have the same regression line (same slope and intercept) but they look very different.
Model diagnostics is important to:
- check the assumptions of the linear regression model are satisfied and
- identify any potential issues with the data that may affect the validity of the model.

Checking Assumptions

Assumptions for linear regression

\[y_i = \beta_0 + \beta_1 x_i + \epsilon_i, \quad i = 1, \ldots, n\]

Recall for \(i=1,\ldots,n\), \(\epsilon_i \stackrel{iid}{\sim} N(0,\sigma^2)\) which means:
- (A1) \(\text{E}(\epsilon_i) = 0\)
- (A2) \(\epsilon_1, \ldots ,\epsilon_n\) are independent.
- (A3) \(\text{Var}(\epsilon_i) = \sigma^2\) which is called the homoscedasticity assumption.
- (A4) \(\epsilon_1, \ldots ,\epsilon_n\) are normally distributed
- (A5) the predictors are known without error.

(A1) \(E(\epsilon_i) = 0\)

A1 is satisfied by definition for the least squares estimates if intercept is included.

\[\sum_{i=1}^n \hat{\epsilon}_i = \sum_{i=1}^n (y_i - \underbrace{(\hat{\beta}_0 + \hat{\beta}_1 x_i)}_{\hat{y}_i}) = n\bar{y} - n\hat{\beta}_0 - n\hat{\beta}_1 \bar{x}= n\underbrace{(\bar{y}- \hat{\beta}_1 \bar{x)}}_{\hat{\beta}_0} - n\hat{\beta}_0 = 0\]

(A2) Independence

A2 is often satisfied by design – if the data are collected in a way that ensures independence (e.g., random sampling, random assignment).
Sometimes plotting the residuals against the order of data entry can also be useful in identifying potential violations of A2.

(A3) Homoscedasticity

A3 can be checked by plotting the residuals against the fitted values and ensuring that there is no pattern or trends.

Also checked by plotting the residuals against the predictor(s) for no pattern or trends.

Example: Anscombe’s quartet

(A4) Normality

If the normal quantile-quantile plot of the residuals is roughly straight then A4 is satisfied.

(A5) Predictors known without error

There are no standard ways to check A5 easily.
If there is measurement error in the predictor, the estimates of the regression coefficients may be biased towards zero (attenuation bias).
If there is measurement error in the response, the estimates of the regression coefficients may be unbiased but the standard errors will be inflated, leading to less precise estimates and wider confidence intervals.

Influence Measures

Cook’s distance

The Cook’s distance for the \(i\)-th observation is defined as:

\[D_i = \frac{1}{p \hat{\sigma}^2} \sum_{j=1}^n (\hat{y}_j - \hat{y}_{j(i)})^2\]

where

\(\hat{y}_j\) is the fitted value for the \(j\)-th observation using all data,
\(\hat{y}_{j(i)}\) is the fitted value for the \(j\)-th observation when the \(i\)-th observation is removed from the dataset, and
\(p\) is the number of parameters in the model (including the intercept), and
\(\hat{\sigma}^2\) is the estimated error variance from the full model.

\(D_i \sim F_{p, n-p}\) if the model assumptions are satisfied and the \(i\)-th observation is not influential.
A common rule of thumb is that if \(D_i > 1\) or \(D_i > F_{0.5, p, n-p}\), then the \(i\)-th observation is considered an outlier and may warrant further investigation.

Leverage values for simple linear regression

For a simple linear regression model, the leverage values can be calculated as:

\[h_i = \frac{1}{n} + \frac{(x_i - \bar{x})^2}{\sum_{j=1}^n (x_j - \bar{x})^2}.\]

The leverage value \(h_i\) measures how far the predictor value(s) for the \(i\)-th observation is from the mean of the predictor values.

The leverage values range from 0 to 1.
A common rule of thumb is that if \(h_i > 3p/n\), then the observation is considered a
high leverage point.

Example: influential points in Anscombe’s quartet

Outliers based on Cook’s distance are highlighted in orange.
High leverage points are highlighted in red
These points may warrant further investigation to determine if they are data entry errors, measurement errors, or valid observations that should be retained in the analysis.
Do not delete these points without a good reason as they may contain important information about the data and the underlying relationships between the variables.

Transforming the Response

Box cox transformation

Box-cox transformation modifies the response for a given value of \(\lambda\) such that:

\[y(\lambda) = \begin{cases} \dfrac{y^{\lambda} - 1}{\lambda} & \text{if } \lambda \neq 0, \\ \log(y) & \text{if } \lambda = 0. \end{cases}\]

The transformation is equivalent to:

\(\lambda\)	Transformation
\(2\)	\(y^2\)
\(1\)	\(y\)
\(0.5\)	\(\sqrt{y}\)
\(0\)	\(\log(y)\)
\(-0.5\)	\(\frac{1}{\sqrt{y}}\)
\(-1\)	\(\frac{1}{y}\)
\(-2\)	\(\frac{1}{y^2}\)

Selecting \(\lambda\) for box-cox transformation

Profile log-likelihood plot suggests \(\lambda \approx 0.5\) which is equivalent to taking the square root of the response.

Transforming the response

Remember the fitted or predicted value need to be squared to get the original scale.

Visualizing the fitted line on the original scale

Visual Diagnostics

Quick diagnostic plots

An easy way to generate diagnostic plots at once is to use the ggResidpanel package.

But the QQ-plot still doesn’t look good enough??

Visual inference

scroll

When making inference from plots, it’s best to calibrate the plot with simulations.
Let’s assume that \(\sqrt{\texttt{Weight}} = \hat{\beta}_0 + \hat{\beta}_1 \texttt{Length} + \epsilon\), where \(\epsilon \sim N(0, \hat{\sigma}^2)\) is the correct model.
Then simulate from this model 17 times:

Visual inference for QQ-plot of residuals

Summary

Model diagnostics is crucial for checking the assumptions of the linear regression model and identifying potential issues with the data.
Common diagnostic plots include residuals vs fitted values, residuals vs predictors, and normal QQ plots of residuals.
Influence measures such as Cook’s distance and leverage values can help identify influential observations that may warrant further investigation.
Transforming the response variable using techniques like the Box-Cox transformation can help address violations of model assumptions.
Visual inference using simulations can provide a more calibrated way to assess the adequacy of the visual diagnostics.

Inference for Linear Regression

Making inferences

The standard deviation of an estimator is referred to as the standard error.
In making inferences, you may like to know:
- what is the standard error of \(\hat{\beta}_1\)?
- is \(\beta_1 \neq 0\)?
- what is the confidence interval of the average \(y\) for a given \(x\)?
- what is the prediction interval of \(y\) for a given \(x\)?
It’s important to note that making inferences require the assumptions of the linear regression model to be satisfied.
So check model assumptions first!

Hypothesis testing for regression parameters

Hypothesis: Suppose we want to test if the \(j\)-th regression parameter is significant: \[H_0: \beta_j = 0 \quad \text{vs} \quad H_A: \beta_j \neq 0\] where \(j \in \{1, ..., p\}\) and \(p\) is the number of regression parameters. Note for simple linear regression \(p = 2\).
Assumption: suppose the errors are independent and identically normally distributed with mean \(0\) and constant variance \(\sigma^2\).
Test statistic: The test statistic and its distribution under \(H_0\) is \[t = \dfrac{\hat{\beta}_j - \beta_j}{\text{SE}(\hat{\beta}_j)} \sim t_{n-p}.\] where \(\text{SE}(\hat{\beta}_j)\) is the standard error of \(\hat{\beta}_j\).

Hypothesis testing for regression parameters: P-value

P-value: \(P(|t_{n - p}| > |t|)\)

Model objects to tidy data

So how do I extract these summary values out?

Confidence interval for regression parameters

The \(100(1-\alpha)\%\) confidence interval for the \(j\)-th regression parameter is

\[\hat{\beta}_j \pm t_{n-p, \alpha/2} \times \text{SE}(\hat{\beta}_j).\]

For \(\alpha = 0.05\), the 95% confidence interval for the regression parameters are:

Confidence interval for the response

The 95% confidence interval for the mean response at \(x = 5\) is \[\hat{y} \pm t_{n-p, \alpha/2} \times \text{SE}(\hat{y})\] where \(\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 \times 5\).

Prediction interval for the response

The prediction interval considers the uncertainty in the error as well and is always wider than the corresponding confidence interval.
The 95% prediction interval for the response at \(x = 5\) is \[\hat{y} \pm t_{n-p, \alpha/2} \times \sqrt{\text{SE}(\hat{y})^2 + \hat{\sigma}^2}\] where \(\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 \times 5\).

Standard error for the response

\[\text{SE}(\hat{y}) = \sqrt{\text{Var}(\hat{\beta}_0 + \hat{\beta}_1 x)} = \sqrt{\text{Var}(\hat{\beta}_0) + x^2 \text{Var}(\hat{\beta}_1) + 2x \text{Cov}(\hat{\beta}_0, \hat{\beta}_1)}.\]

Summary

scroll

Source: xkcd

Fitting a linear model:

Getting the model summary:

Getting just the regression coefficient estimates

Getting the regression coefficient table in the model summary as a tibble:

Getting the fitted values \(\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i\):

Getting the residuals \((y_i - \hat{y}_i)\):

Augment the data with fitted values, residuals, etc:

Getting the deviance (residual sum of squares):

The estimate of the error standard deviation \(\hat{\sigma}\):

Getting the influence measures such as Cook’s distance and leverage values:

Selecting \(\lambda\) for box-cox transformation:

Quick diagnostic plots:

Confidence interval for regression parameters:

Prediction for mean response:

Confidence interval for mean response:

Prediction interval for response:

Standard error for prediction of mean response: