M1 <-lm(log10(charges) ~ bmi + age + bmi:age + children + smoker, data = insurance)broom::tidy(M1)M1 <-lm(log10(charges) ~ bmi + age + bmi:age + children + smoker, data = insurance)broom::tidy(M1)
where bmi30 is a feature engineered binary variable where bmi>30
M1 is a nested model of M2 (where β62=β72=0)
Comparing nested regression models
insurance30 <- insurance %>%mutate(bmi30 =as.numeric(bmi >30))M1 <-lm(log10(charges) ~ bmi + age + bmi:age + children + smoker, data = insurance)M2 <-lm(log10(charges) ~ bmi + age + bmi:age + children + smoker + bmi30 + smoker:bmi30, data = insurance30)anova(M1, M2)insurance30 <- insurance %>%mutate(bmi30 =as.numeric(bmi >30))M1 <-lm(log10(charges) ~ bmi + age + bmi:age + children + smoker, data = insurance)M2 <-lm(log10(charges) ~ bmi + age + bmi:age + children + smoker + bmi30 + smoker:bmi30, data = insurance30)anova(M1, M2)insurance30 <- insurance %>%mutate(bmi30 =as.numeric(bmi >30))M1 <-lm(log10(charges) ~ bmi + age + bmi:age + children + smoker, data = insurance)M2 <-lm(log10(charges) ~ bmi + age + bmi:age + children + smoker + bmi30 + smoker:bmi30, data = insurance30)anova(M1, M2)insurance30 <- insurance %>%mutate(bmi30 =as.numeric(bmi >30))M1 <-lm(log10(charges) ~ bmi + age + bmi:age + children + smoker, data = insurance)M2 <-lm(log10(charges) ~ bmi + age + bmi:age + children + smoker + bmi30 + smoker:bmi30, data = insurance30)anova(M1, M2)insurance30 <- insurance %>%mutate(bmi30 =as.numeric(bmi >30))M1 <-lm(log10(charges) ~ bmi + age + bmi:age + children + smoker, data = insurance)M2 <-lm(log10(charges) ~ bmi + age + bmi:age + children + smoker + bmi30 + smoker:bmi30, data = insurance30)anova(M1, M2)
Analysis of Variance Table
Model 1: log10(charges) ~ bmi + age + bmi:age + children + smoker
Model 2: log10(charges) ~ bmi + age + bmi:age + children + smoker + bmi30 +
smoker:bmi30
Res.Df RSS Df Sum of Sq F Pr(>F)
1 1332 50.645
2 1330 46.332 2 4.3126 61.898 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Comparing nested models using the F-test
H0:β62=β72=0 vs H1:β62=0 and/or β72=0
The F-statistic is given as: F∗=RSSM2/(n−p2−1)(RSSM1−RSSM2)/(p2−p1)~Fp2−p1,n−p2−1
where p2 is the number of coefficients (minus one) in M2.
The p-value for the F-test is given as P(Fp2−p1,n−p2−1>F∗).
Selecting predictors
What predictors do we use?
For prediction, ideally those that improve prediction of new records.
Why not use all predictors?
If you add predictors that are irrelevant, you add noise to the parameter estimation.
This may affect the predictive accuracy of new records.
We discuss two approaches:
exhaustive search and
subset selection algorithms.
Exhaustive search
Exhaustive search
In an exhaustive search, we fit models with all possible combinations of the predictors.
For p predictors these include:
all p models with one predictor,
all (2p) models with two predictors, and so on, …
a model with p predictors.
There are a total of p+(2p)+⋯+1=∑k=1p(kp)=2p−1models to consider.
We can then select the model with the best metrics.
Exhaustive search with leaps package
The leaps package can do an exhaustive search for regression models.
library(leaps)res <-regsubsets(log10(charges) ~ ., data = insurance, method ="exhaustive",nvmax =ncol(insurance), # maximum number of variables - use NULL for no limitnbest =1) # report the best model for each number of variablesresout <-summary(res)
The best model for a given number of variable is efficiently found using the leaps and bound approach (out of the scope for this unit).
Caveats in automated searches
Selection algorithms can treat variables without structure so it can result in undesirable searches.
Categorical variables are often converted to dummy variables, but no group selection occurs. E.g. the variable region has the levels northeast, northwest, southeast, and southwest, but the automated search may select southeast only, instead of the whole region variable.
Similarly, main effects may be omitted when the interaction effect is included.
For polynomial effects, the lower order effects may be excluded even if the higher order effects are included.
No! As mentioned last week, the R2 always increases with p so it will just chose the model with all p predictors.
Instead we can use the adjusted R2.
Other goodness of fit criteria
Adjusted R2 is a function of RSS=∑i=1ne^i2 but why not use ∑i=1n∣e^i∣ or some other loss function instead?
We can construct a criteria with the structure i=1∑nρ(e^i)+λ(pM,n) where ρ(z) is a non-negative loss function (e.g. z2) and λ is a penalty term for the complexity of the model.
Akaike information criterion
scroll
The Akaike information criterion (AIC) is defined as
We use the maximum likelihood estimators for βj and σ then RSS=∑i=1n(yi−∑j=1pxijβ^j)2 and σ^2=RSS/n.
AIC^(M)=nlog(nRSS(M))+2pM+constant.
Mallows’s Cp
A reasonable estimate of the test MSE is measured by Mallows’s Cp as: C^p=n1(RSS+2pMσ^2) where σ^2 is an estimate of the variance of the error e in the full model containing all p predictors.
An alternative (normalised) form is given as
C^p=σ^2RSS−n+2(pM+1).
Mallows’s Cp and AIC
The two forms of Mallows’s Cp do not give equivalent values but the same model will be selected based on the smallest C^p.
If σ is known, AIC can also be written as
AIC=σ21SSE+2pM+constant.
If σ is known, then Cp=AIC+constant so AIC is often considered equivalent to Cp for linear regression models with Normally distributed errors.
Bayesian information criterion
AIC and Cp have a tendency to include too many variables.
Bayesian information criterion (BIC) is defined as BIC(M)=−2log-likelihood+log(n)⋅pM.
BIC penalises the inclusion of more variables.
BIC is derived to find the “true” model amongst the candidate models.
Computing AIC and BIC in R
The computation of AIC (or Mallows’s Cp) and BIC can differ by a constant across different packages.
Therefore take caution comparing these metrics across different packages! They are often not comparable.
In R you can find the AIC and BIC using the extractAIC function:
extractAIC(M1, k =2) # default is AIC
[1] 6.000 -4368.738
extractAIC(M1, k =log(nrow(insurance))) # BIC
[1] 6.000 -4337.544
Subset selection algorithms
Too many models to consider
How many (additive) models to consider when p=20?
220=1,048,576 models
It quickly becomes infeasible to consider all models.
Automated subset selection algorithms
Automated subset selection algorithms “walk along” a “path”:
Choose a model to start with (null or full model).
Test to see if there is an advantage to add/drop variables.
Repeat adding/dropping variables until there is no change in the model.
This search strategy requires us to consider only a quadratic order of the number of models, but are there is no guarantee to result in the optimal solution.
These algorithms include forward selection, backward elimination and stepwise regression.
Forward selection
Start with the model with no variables (the null model).
For each variable, investigate the advantage of adding into the current model.
Add the most informative or significant variable, unless this variable is not supplying significant information about the response.
Repeat step 2-3 until none of the non-included variables are important.
Backward elimination
Start with the model with all variables (the full model).
For each variable, investigate the advantage of removing from the current model.
Remove the least informative variable, unless this variable is supplying significant information about the response.
Repeat 2-3 until all variables are important.
Stepwise regression
Start with either the null or full model.
For each variable, investigate the advantage of removing from the current model.
Remove the least informative variable, unless this variable is supplying significant information about the response.
For each variable, investigate the advantage of adding into the current model.
Add the most informative or significant variable, unless this variable is not supplying significant information about the response.
Repeat step 2-5 until there is no change in the model.
Variable selection with R
scroll
null <-lm(log10(charges) ~1, data = insurance)full <-lm(log10(charges) ~ ., data = insurance)scope <-list(lower =formula(null),upper =formula(full))# forward selection with AICstep(null, scope = scope, direction ="forward", k =2)
Start: AIC=-2455.38
log10(charges) ~ 1
Df Sum of Sq RSS AIC
+ smoker 1 94.435 118.79 -3236.1
+ age 1 59.405 153.81 -2890.3
+ children 1 5.550 207.67 -2488.7
+ bmi 1 3.753 209.47 -2477.1
<none> 213.22 -2455.4
+ region 3 0.670 212.55 -2453.6
+ sex 1 0.007 213.21 -2453.4
Step: AIC=-3236.12
log10(charges) ~ smoker
Df Sum of Sq RSS AIC
+ age 1 63.252 55.534 -4251.4
+ children 1 5.205 113.581 -3294.1
+ bmi 1 3.613 115.173 -3275.4
+ sex 1 0.436 118.350 -3239.0
<none> 118.786 -3236.1
+ region 3 0.467 118.318 -3235.4
Step: AIC=-4251.43
log10(charges) ~ smoker + age
Df Sum of Sq RSS AIC
+ children 1 3.7780 51.756 -4343.7
+ bmi 1 1.0753 54.459 -4275.6
+ region 3 0.4363 55.098 -4256.0
+ sex 1 0.2590 55.275 -4255.7
<none> 55.534 -4251.4
Step: AIC=-4343.7
log10(charges) ~ smoker + age + children
Df Sum of Sq RSS AIC
+ bmi 1 1.04289 50.713 -4368.9
+ region 3 0.47102 51.285 -4349.9
+ sex 1 0.29478 51.461 -4349.3
<none> 51.756 -4343.7
Step: AIC=-4368.94
log10(charges) ~ smoker + age + children + bmi
Df Sum of Sq RSS AIC
+ region 3 0.87917 49.834 -4386.3
+ sex 1 0.35181 50.361 -4376.2
<none> 50.713 -4368.9
Step: AIC=-4386.33
log10(charges) ~ smoker + age + children + bmi + region
Df Sum of Sq RSS AIC
+ sex 1 0.35563 49.478 -4393.9
<none> 49.834 -4386.3
Step: AIC=-4393.92
log10(charges) ~ smoker + age + children + bmi + region + sex
Call:
lm(formula = log10(charges) ~ smoker + age + children + bmi +
region + sex, data = insurance)
Coefficients:
(Intercept) smokeryes age children
3.053333 0.675034 0.015019 0.044236
bmi regionnorthwest regionsoutheast regionsouthwest
0.005809 -0.027703 -0.068270 -0.056003
sexmale
-0.032753
# backward selection with BICstep(full, scope = scope, direction ="backward", k =log(nrow(insurance)))
Start: AIC=-4347.13
log10(charges) ~ age + sex + bmi + children + smoker + region
Df Sum of Sq RSS AIC
<none> 49.478 -4347.1
- region 3 0.883 50.361 -4345.1
- sex 1 0.356 49.834 -4344.7
- bmi 1 1.516 50.994 -4313.9
- children 1 3.787 53.265 -4255.7
- age 1 58.546 108.024 -3309.6
- smoker 1 98.101 147.580 -2892.1
Call:
lm(formula = log10(charges) ~ age + sex + bmi + children + smoker +
region, data = insurance)
Coefficients:
(Intercept) age sexmale bmi
3.053333 0.015019 -0.032753 0.005809
children smokeryes regionnorthwest regionsoutheast
0.044236 0.675034 -0.027703 -0.068270
regionsouthwest
-0.056003
# stepwise regression with AIC starting with full modelstep(full, scope = scope, direction ="both", k =2)
Start: AIC=-4393.92
log10(charges) ~ age + sex + bmi + children + smoker + region
Df Sum of Sq RSS AIC
<none> 49.478 -4393.9
- sex 1 0.356 49.834 -4386.3
- region 3 0.883 50.361 -4376.2
- bmi 1 1.516 50.994 -4355.5
- children 1 3.787 53.265 -4297.2
- age 1 58.546 108.024 -3351.2
- smoker 1 98.101 147.580 -2933.7
Call:
lm(formula = log10(charges) ~ age + sex + bmi + children + smoker +
region, data = insurance)
Coefficients:
(Intercept) age sexmale bmi
3.053333 0.015019 -0.032753 0.005809
children smokeryes regionnorthwest regionsoutheast
0.044236 0.675034 -0.027703 -0.068270
regionsouthwest
-0.056003
Drawbacks of these methods
The main issue is that these subset selection procedures potentially identify models that are only locally optimal so there is guarantee to provide optimal solution.
These methods might disagree with each other and select different models.
Backwards elimination methods are not feasible when p>n.
Takeaways
Multiple linear regression is an intuitive and useful tool in prediction.
When selecting a model, there are many aspects such as non-linearity and interactions between predictors that must be considered.
Model selection algorithms can help, but they also have some limitations.
Variable selection - table of contents ETC3250/5250 Variable selection Predicting the insurance cost Pair plots Investigating further Variable selection Initial proposed regression model Review: t -test for regression coefficients Proposed multiple linear regression models Comparing nested regression models Comparing nested models using the F -test Selecting predictors Exhaustive search Exhaustive search Exhaustive search with leaps package Caveats in automated searches Best models per number of variables Comparing models with different number of variables Model metrics Visualising model metrics R^2 and adjusted R^2 Other goodness of fit criteria Akaike information criterion Mallows’s C_p Mallows’s C_p and AIC Bayesian information criterion Computing AIC and BIC in R Subset selection algorithms Too many models to consider Automated subset selection algorithms Forward selection Backward elimination Stepwise regression Variable selection with R Drawbacks of these methods Takeaways