The F-statistic is given as: F^* = \frac{(\color{#027EB6}{RSS_{\mathcal{M}_1}}-\color{#D93F00}{RSS_{\mathcal{M}_2}})/(\color{#D93F00}{p_{2}}-\color{#027EB6}{p_{1}})}{\color{#D93F00}{RSS_{\mathcal{M}_2}}/(n-\color{#D93F00}{p_{2}}-1)} \tilde \cal{F}_{\color{#D93F00}{p_{2}}-\color{#027EB6}{p_{1}},n-\color{#D93F00}{p_{2}}-1}
where p_2 is the number of coefficients (minus one) in \mathcal{M}_2.
The p-value for the F-test is given as P(F_{p_2 - p_1, n - p_2 - 1} > F^*).
Selecting predictors
What predictors do we use?
For prediction, ideally those that improve prediction of new records.
Why not use all predictors?
If you add predictors that are irrelevant, you add noise to the parameter estimation.
This may affect the predictive accuracy of new records.
We discuss two approaches:
exhaustive search and
subset selection algorithms.
Exhaustive search
Exhaustive search
In an exhaustive search, we fit models with all possible combinations of the predictors.
For p predictors these include:
all p models with one predictor,
all p\choose 2 models with two predictors, and so on, …
a model with p predictors.
There are a total of p + {p \choose 2} + \dots + 1 = \sum_{k=1}^p {p \choose k} = \color{#006DAE}{2^p - 1}models to consider.
We can then select the model with the best metrics.
Exhaustive search with leaps package
The leaps package can do an exhaustive search for regression models.
library(leaps)res <-regsubsets(log10(charges) ~ ., data = insurance, method ="exhaustive",nvmax =ncol(insurance), # maximum number of variables - use NULL for no limitnbest =1) # report the best model for each number of variablesresout <-summary(res)
The best model for a given number of variable is efficiently found using the leaps and bound approach (out of the scope for this unit).
Caveats in automated searches
Selection algorithms can treat variables without structure so it can result in undesirable searches.
Categorical variables are often converted to dummy variables, but no group selection occurs. E.g. the variable region has the levels northeast, northwest, southeast, and southwest, but the automated search may select southeast only, instead of the whole region variable.
Similarly, main effects may be omitted when the interaction effect is included.
For polynomial effects, the lower order effects may be excluded even if the higher order effects are included.
Would the R^2 (or RSS) be a good accuracy measure?
No! As mentioned last week, the R^2 always increases with p so it will just chose the model with all p predictors.
Instead we can use the adjusted R^2.
Other goodness of fit criteria
Adjusted R^2 is a function of RSS = \sum_{i=1}^n\hat{e}_i^2 but why not use \sum_{i=1}^n|\hat{e}_i| or some other loss function instead?
We can construct a criteria with the structure \sum_{i=1}^n\rho(\hat{e}_i) + \lambda(p_\mathcal{M}, n) where \rho(z) is a non-negative loss function (e.g. z^2) and \lambda is a penalty term for the complexity of the model.
Akaike information criterion
scroll
The Akaike information criterion (AIC) is defined as
We use the maximum likelihood estimators for \beta_j and \sigma then RSS = \sum_{i=1}^n \left(y_i - \sum_{j=1}^p x_{ij}\hat{\beta}_j\right)^2 and \hat{\sigma}^2 = RSS / n.
A reasonable estimate of the test MSE is measured by Mallows’s C_p as: \hat{C}_p = \frac{1}{n} (RSS+2p_\mathcal{M}\hat{\sigma}^2) where \hat{\sigma}^2 is an estimate of the variance of the error e in the full model containing all p predictors.
An alternative (normalised) form is given as
\hat{C}_p = \frac{RSS}{\hat{\sigma}^2} - n + 2(p_\mathcal{M} + 1).
Mallows’s C_p and AIC
The two forms of Mallows’s C_p do not give equivalent values but the same model will be selected based on the smallest \hat{C}_p.
If \sigma is known, then C_p = AIC + \text{constant} so AIC is often considered equivalent to C_p for linear regression models with Normally distributed errors.
Bayesian information criterion
AIC and C_p have a tendency to include too many variables.
Bayesian information criterion (BIC) is defined as \text{BIC}(\mathcal{M}) = -2~\text{log-likelihood} + \log(n)\cdot p_\mathcal{M}.
BIC penalises the inclusion of more variables.
BIC is derived to find the “true” model amongst the candidate models.
Computing AIC and BIC in R
The computation of AIC (or Mallows’s C_p) and BIC can differ by a constant across different packages.
Therefore take caution comparing these metrics across different packages! They are often not comparable.
In R you can find the AIC and BIC using the extractAIC function:
extractAIC(M1, k =2) # default is AIC
[1] 6.000 -4368.738
extractAIC(M1, k =log(nrow(insurance))) # BIC
[1] 6.000 -4337.544
Subset selection algorithms
Too many models to consider
How many (additive) models to consider when p = 20?
2^{20} = 1,048,576 \text{ models}
It quickly becomes infeasible to consider all models.
Automated subset selection algorithms
Automated subset selection algorithms “walk along” a “path”:
Choose a model to start with (null or full model).
Test to see if there is an advantage to add/drop variables.
Repeat adding/dropping variables until there is no change in the model.
This search strategy requires us to consider only a quadratic order of the number of models, but are there is no guarantee to result in the optimal solution.
These algorithms include forward selection, backward elimination and stepwise regression.
Forward selection
Start with the model with no variables (the null model).
For each variable, investigate the advantage of adding into the current model.
Add the most informative or significant variable, unless this variable is not supplying significant information about the response.
Repeat step 2-3 until none of the non-included variables are important.
Backward elimination
Start with the model with all variables (the full model).
For each variable, investigate the advantage of removing from the current model.
Remove the least informative variable, unless this variable is supplying significant information about the response.
Repeat 2-3 until all variables are important.
Stepwise regression
Start with either the null or full model.
For each variable, investigate the advantage of removing from the current model.
Remove the least informative variable, unless this variable is supplying significant information about the response.
For each variable, investigate the advantage of adding into the current model.
Add the most informative or significant variable, unless this variable is not supplying significant information about the response.
Repeat step 2-5 until there is no change in the model.
Variable selection with R
scroll
null <-lm(log10(charges) ~1, data = insurance)full <-lm(log10(charges) ~ ., data = insurance)scope <-list(lower =formula(null),upper =formula(full))# forward selection with AICstep(null, scope = scope, direction ="forward", k =2)
Start: AIC=-2455.38
log10(charges) ~ 1
Df Sum of Sq RSS AIC
+ smoker 1 94.435 118.79 -3236.1
+ age 1 59.405 153.81 -2890.3
+ children 1 5.550 207.67 -2488.7
+ bmi 1 3.753 209.47 -2477.1
<none> 213.22 -2455.4
+ region 3 0.670 212.55 -2453.6
+ sex 1 0.007 213.21 -2453.4
Step: AIC=-3236.12
log10(charges) ~ smoker
Df Sum of Sq RSS AIC
+ age 1 63.252 55.534 -4251.4
+ children 1 5.205 113.581 -3294.1
+ bmi 1 3.613 115.173 -3275.4
+ sex 1 0.436 118.350 -3239.0
<none> 118.786 -3236.1
+ region 3 0.467 118.318 -3235.4
Step: AIC=-4251.43
log10(charges) ~ smoker + age
Df Sum of Sq RSS AIC
+ children 1 3.7780 51.756 -4343.7
+ bmi 1 1.0753 54.459 -4275.6
+ region 3 0.4363 55.098 -4256.0
+ sex 1 0.2590 55.275 -4255.7
<none> 55.534 -4251.4
Step: AIC=-4343.7
log10(charges) ~ smoker + age + children
Df Sum of Sq RSS AIC
+ bmi 1 1.04289 50.713 -4368.9
+ region 3 0.47102 51.285 -4349.9
+ sex 1 0.29478 51.461 -4349.3
<none> 51.756 -4343.7
Step: AIC=-4368.94
log10(charges) ~ smoker + age + children + bmi
Df Sum of Sq RSS AIC
+ region 3 0.87917 49.834 -4386.3
+ sex 1 0.35181 50.361 -4376.2
<none> 50.713 -4368.9
Step: AIC=-4386.33
log10(charges) ~ smoker + age + children + bmi + region
Df Sum of Sq RSS AIC
+ sex 1 0.35563 49.478 -4393.9
<none> 49.834 -4386.3
Step: AIC=-4393.92
log10(charges) ~ smoker + age + children + bmi + region + sex
Call:
lm(formula = log10(charges) ~ smoker + age + children + bmi +
region + sex, data = insurance)
Coefficients:
(Intercept) smokeryes age children
3.053333 0.675034 0.015019 0.044236
bmi regionnorthwest regionsoutheast regionsouthwest
0.005809 -0.027703 -0.068270 -0.056003
sexmale
-0.032753
# backward selection with BICstep(full, scope = scope, direction ="backward", k =log(nrow(insurance)))
Start: AIC=-4347.13
log10(charges) ~ age + sex + bmi + children + smoker + region
Df Sum of Sq RSS AIC
<none> 49.478 -4347.1
- region 3 0.883 50.361 -4345.1
- sex 1 0.356 49.834 -4344.7
- bmi 1 1.516 50.994 -4313.9
- children 1 3.787 53.265 -4255.7
- age 1 58.546 108.024 -3309.6
- smoker 1 98.101 147.580 -2892.1
Call:
lm(formula = log10(charges) ~ age + sex + bmi + children + smoker +
region, data = insurance)
Coefficients:
(Intercept) age sexmale bmi
3.053333 0.015019 -0.032753 0.005809
children smokeryes regionnorthwest regionsoutheast
0.044236 0.675034 -0.027703 -0.068270
regionsouthwest
-0.056003
# stepwise regression with AIC starting with full modelstep(full, scope = scope, direction ="both", k =2)
Start: AIC=-4393.92
log10(charges) ~ age + sex + bmi + children + smoker + region
Df Sum of Sq RSS AIC
<none> 49.478 -4393.9
- sex 1 0.356 49.834 -4386.3
- region 3 0.883 50.361 -4376.2
- bmi 1 1.516 50.994 -4355.5
- children 1 3.787 53.265 -4297.2
- age 1 58.546 108.024 -3351.2
- smoker 1 98.101 147.580 -2933.7
Call:
lm(formula = log10(charges) ~ age + sex + bmi + children + smoker +
region, data = insurance)
Coefficients:
(Intercept) age sexmale bmi
3.053333 0.015019 -0.032753 0.005809
children smokeryes regionnorthwest regionsoutheast
0.044236 0.675034 -0.027703 -0.068270
regionsouthwest
-0.056003
Drawbacks of these methods
The main issue is that these subset selection procedures potentially identify models that are only locally optimal so there is guarantee to provide optimal solution.
These methods might disagree with each other and select different models.
Backwards elimination methods are not feasible when p > n.
Takeaways
Multiple linear regression is an intuitive and useful tool in prediction.
When selecting a model, there are many aspects such as non-linearity and interactions between predictors that must be considered.
Model selection algorithms can help, but they also have some limitations.