
STAT1003 – Statistical Techniques
Dr. Emi Tanaka
Australian National University
These slides are best viewed on a modern browser like Google Chrome on a desktop or laptop. Some interactive components may require some time to fully load.
We seek to model the relationship between:
\[y = f(x) = \beta_0 + \beta_1 x.\]
\(y_i =\) \(\beta_0\) \(+\) \(\beta_1\)\(x_i +\) \(\epsilon_i\)
intercept slope error for the \(i\)-th observation
\[\text{RSS}(\beta_0, \beta_1) = \sum_{i=1}^n \left(\underbrace{y_i - (\beta_0 + \beta_1 x_i)}_{\text{residual}}\right)^2\]
\[\begin{align*} \frac{\partial \text{RSS}}{\partial \beta_0} &= -2\sum_{i=1}^n \left(y_i - (\beta_0 + \beta_1 x_i)\right) = 0\\ \frac{\partial \text{RSS}}{\partial \beta_1} &= -2\sum_{i=1}^n x_i \left(y_i - (\beta_0 + \beta_1 x_i)\right) = 0 \end{align*}\]
Solving the above equations gives the least squares estimates:
\[\begin{align*} \hat{\beta}_1 &= \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} = \frac{s_{xy}}{s_x^2}\\ \hat{\beta}_0 &= \bar{y} - \hat{\beta}_1 \bar{x} \end{align*}\]
where:
\[\texttt{Weight}_i=\beta_0 + \beta_1\texttt{Length}_i + e_i\]
fit even contain?coef(), fitted(), predict(), residuals(), and sigma().
\[\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i.\]

\[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x.\]

\[\hat{\epsilon}_i = y_i - \hat{y}_i = y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i).\]
geom_smooth() makes it easy to add the model to a scatter plotggpubr::stat_regline_equation() adds the regression line to the plotRecall correlation coefficient \[r = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}} = \frac{s_{xy}}{s_x s_y}\] is a measure of the strength and direction of the linear relationship between two variables.
Relationship between the slope and the correlation coefficient:
\[\hat{\beta}_1 = \frac{s_{xy}}{s_{x}^2} = \frac{s_{xy}}{s_x s_y}\frac{s_y}{s_x} = r \frac{s_y}{s_x}.\]
viewof nsample = Inputs.number([20, 1000], {step: 20, value: 200, label: "n"})
viewof intercept = Inputs.range([-10, 10], {step: 0.05, value: 0.8, label: "β₀"})
viewof slope = Inputs.range([-10,10], {step: 0.05, value: 0.8, label: "β₁"})
viewof sigma = Inputs.range([0.1, 20], {step: 0.1, value: 3, label: "σ"})\[y_i = \beta_0 + \beta_1 x_i + \epsilon_i\]
where \(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)\) for \(i = 1, 2, \ldots, n\).

\[y_i = \beta_0 + \beta_1 x_i + \epsilon_i, \quad i = 1, \ldots, n\]
\[\sum_{i=1}^n \hat{\epsilon}_i = \sum_{i=1}^n (y_i - \underbrace{(\hat{\beta}_0 + \hat{\beta}_1 x_i)}_{\hat{y}_i}) = n\bar{y} - n\hat{\beta}_0 - n\hat{\beta}_1 \bar{x}= n\underbrace{(\bar{y}- \hat{\beta}_1 \bar{x)}}_{\hat{\beta}_0} - n\hat{\beta}_0 = 0\]


The Cook’s distance for the \(i\)-th observation is defined as:
\[D_i = \frac{1}{p \hat{\sigma}^2} \sum_{j=1}^n (\hat{y}_j - \hat{y}_{j(i)})^2\]
where
For a simple linear regression model, the leverage values can be calculated as:
\[h_i = \frac{1}{n} + \frac{(x_i - \bar{x})^2}{\sum_{j=1}^n (x_j - \bar{x})^2}.\]

\[y(\lambda) = \begin{cases} \dfrac{y^{\lambda} - 1}{\lambda} & \text{if } \lambda \neq 0, \\ \log(y) & \text{if } \lambda = 0. \end{cases}\]
| \(\lambda\) | Transformation |
|---|---|
| \(2\) | \(y^2\) |
| \(1\) | \(y\) |
| \(0.5\) | \(\sqrt{y}\) |
| \(0\) | \(\log(y)\) |
| \(-0.5\) | \(\frac{1}{\sqrt{y}}\) |
| \(-1\) | \(\frac{1}{y}\) |
| \(-2\) | \(\frac{1}{y^2}\) |
ggResidpanel package.scroll
Hypothesis: Suppose we want to test if the \(j\)-th regression parameter is significant: \[H_0: \beta_j = 0 \quad \text{vs} \quad H_A: \beta_j \neq 0\] where \(j \in \{1, ..., p\}\) and \(p\) is the number of regression parameters. Note for simple linear regression \(p = 2\).
Assumption: suppose the errors are independent and identically normally distributed with mean \(0\) and constant variance \(\sigma^2\).
Test statistic: The test statistic and its distribution under \(H_0\) is \[t = \dfrac{\hat{\beta}_j - \beta_j}{\text{SE}(\hat{\beta}_j)} \sim t_{n-p}.\] where \(\text{SE}(\hat{\beta}_j)\) is the standard error of \(\hat{\beta}_j\).
So how do I extract these summary values out?
\[\hat{\beta}_j \pm t_{n-p, \alpha/2} \times \text{SE}(\hat{\beta}_j).\]
\[\text{SE}(\hat{y}) = \sqrt{\text{Var}(\hat{\beta}_0 + \hat{\beta}_1 x)} = \sqrt{\text{Var}(\hat{\beta}_0) + x^2 \text{Var}(\hat{\beta}_1) + 2x \text{Cov}(\hat{\beta}_0, \hat{\beta}_1)}.\]
scroll
Source: xkcd

STAT1003 – Statistical Techniques