Introduction to Machine Learning
Lecturer: Emi Tanaka
Department of Econometrics and Business Statistics
\[y_i = \beta_1x_{i1} + \beta_2x_{i2} + e_i,\]
Data simulated with: \(\beta_1 = 2, \beta_2 = 3, e_i \sim N(0, 1)\)
Axis grid
X
: \(\beta_1\)Y
: \(\beta_2\)Z
: \(\text{RSS}(\beta_1, \beta_2)\)\[{\hat{\boldsymbol\beta}_{ridge} = \underset{\boldsymbol\beta\in\mathbb{R}^{p+1}}{\mbox{arg min}}\ \text{RSS}(\boldsymbol{\beta})} \color{#006DAE}{\quad {\text{subject to } \sum_{j=1}^p\beta_j^2\leq s}}.\]
\[\sum_{j=1}^p\beta_j^2\leq s\]
\(\beta_1^2 + \beta_2^2 \leq 1\)
Axis grid
X
: \(\beta_1\)Y
: \(\beta_2\)Z
: \(\text{RSS}(\beta_1, \beta_2)\)\[{\hat{\boldsymbol\beta} = \underset{\boldsymbol\beta\in\mathbb{R}^{2}}{\mbox{arg min}}\ \text{RSS}(\boldsymbol{\beta})}\]
\[\text{subject to } \beta_1^2 + \beta_2^2\leq s\]
X
: \(\beta_1\)Y
: \(\beta_2\)Z
: \(\text{RSS}(\beta_1, \beta_2)\)Constrained optimization problems are much harder than unconstrained ones.
Finding points in one surface is much easier than finding points simultaneously on both surfaces.
Ridge regression can be re-written as a Lagrange function: \[\hat{\boldsymbol\beta}_{\text{ridge}} = \underset{\boldsymbol\beta\in\mathbb{R}^{p+1}}{\mbox{arg min}}~\left\{\text{RSS}(\boldsymbol{\beta}) + \lambda \sum_{j=1}^p \beta_j^2\right\}.\]
\(\lambda \geq 0\) is called a tuning parameter and serves a similar role to \(s\).
\(\lambda \sum_{j=1}^p \beta_j^2\) is referred to as the shrinkage penalty.
Constrained
Unconstrained
Rows: 6,738
Columns: 9
$ model <chr> "GT86", "GT86", "GT86", "GT86", "GT86", "GT86", "GT86", "…
$ year <dbl> 2016, 2017, 2015, 2017, 2017, 2017, 2017, 2017, 2020, 201…
$ price <dbl> 16000, 15995, 13998, 18998, 17498, 15998, 18522, 18995, 2…
$ transmission <chr> "Manual", "Manual", "Manual", "Manual", "Manual", "Manual…
$ mileage <dbl> 24089, 18615, 27469, 14736, 36284, 26919, 10456, 12340, 5…
$ fuelType <chr> "Petrol", "Petrol", "Petrol", "Petrol", "Petrol", "Petrol…
$ tax <dbl> 265, 145, 265, 150, 145, 260, 145, 145, 150, 265, 265, 14…
$ mpg <dbl> 36.2, 36.2, 36.2, 36.2, 36.2, 36.2, 36.2, 36.2, 33.2, 36.…
$ engineSize <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
recipes
📦scroll
library(tidymodels)
library(rsample)
toyota_splits <- initial_split(toyota, prop = 0.75, strata = model)
toyota_recipe <- recipe(price ~ ., data = training(toyota_splits)) %>%
# standardising predictors
step_normalize(all_numeric_predictors()) %>%
# change categorical variables to dummy variables
step_dummy(all_nominal_predictors()) %>%
# remove any predictors with zero variance (i.e. constant)
step_zv(all_predictors()) %>%
# log transform the response
step_log(all_outcomes(), base = 10) %>%
# now signal that you are done
prep()
toyota_recipe
Recipe
Inputs:
role #variables
outcome 1
predictor 8
Training data contained 5052 data points and no missing data.
Operations:
Centering and scaling for year, mileage, tax, mpg, engineSize [trained]
Dummy variables from model, transmission, fuelType [trained]
Zero variance filter removed <none> [trained]
Log transformation on price [trained]
recipes
📦toyota_train <- toyota_recipe %>%
bake(new_data = NULL)
toyota_test <- toyota_recipe %>%
bake(new_data = testing(toyota_splits))
toyota_test
# A tibble: 1,686 × 29
year mileage tax mpg engineSize price model_A…¹ model…² model…³ model…⁴
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -0.342 0.0633 2.30 -1.72 1.20 4.20 0 0 0 0
2 -0.798 0.241 2.30 -1.72 1.20 4.15 0 0 0 0
3 0.114 -0.540 0.684 -1.72 1.20 4.26 0 0 0 0
4 0.570 0.648 0.684 -1.72 1.20 4.27 0 0 0 0
5 1.03 -1.06 0.752 -1.91 1.20 4.43 0 0 0 0
6 -0.798 0.400 2.30 -1.72 1.20 4.15 0 0 0 0
7 1.03 -1.02 0.684 -1.94 1.20 4.43 0 0 0 0
8 0.570 -0.924 0.684 -1.72 1.20 4.32 0 0 0 0
9 -1.71 1.45 2.30 -1.72 1.20 4.06 0 0 0 0
10 1.03 -1.20 0.684 -1.94 1.20 4.43 0 0 0 0
# … with 1,676 more rows, 19 more variables: model_Corolla <dbl>,
# model_GT86 <dbl>, model_Hilux <dbl>, model_IQ <dbl>,
# model_Land.Cruiser <dbl>, model_Prius <dbl>, model_PROACE.VERSO <dbl>,
# model_RAV4 <dbl>, model_Supra <dbl>, model_Urban.Cruiser <dbl>,
# model_Verso <dbl>, model_Verso.S <dbl>, model_Yaris <dbl>,
# transmission_Manual <dbl>, transmission_Other <dbl>,
# transmission_Semi.Auto <dbl>, fuelType_Hybrid <dbl>, …
glmnet
package.glmnet
package doesn’t use symbolic model formula and requires categorical variables to be converted to dummy variables.library(broom) # for tidy
fit_ridge %>%
tidy(return_zeros = TRUE) %>%
filter(term != "(Intercept)") %>%
ggplot(aes(lambda, estimate)) +
geom_line(data = ~rename(., term2 = term),
aes(group = term2),
color = "grey") +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_line(color = "royalblue2") +
facet_wrap(~term, nrow = 4) +
labs(x = latex2exp::TeX("\\lambda"),
y = "Standardised coefficients") +
scale_x_log10()
scroll
cv_glmnet_to_toyota <- function(alpha) {
# Note: you're applying cross validation to the training dataset!
toyota_train %>%
# make the 10-fold cross validation data set
vfold_cv(v = 10) %>%
# for each fold, calculate the model metrics
mutate(metrics = map(splits, function(.split) {
# get the fold training dataset
fold_train_data <- training(.split)
# fit the model to the fold training dataset
fold_fit <- glmnet(x = fold_train_data %>%
select(-price),
y = fold_train_data$price,
alpha = alpha,
lambda = lambda_vec)
# now get the validation dataset
fold_test_data <- testing(.split)
# get the predictions for this fold and models
fold_preds <- fold_fit %>%
predict(as.matrix(select(fold_test_data, -price))) %>%
as.data.frame() %>%
add_column(price = fold_test_data$price) %>%
pivot_longer(-price, values_to = ".pred", names_to = "name") %>%
left_join(tibble(name = paste0("s", 1:length(lambda_vec) - 1),
lambda = rev(lambda_vec)),
by = "name")
# get the model metrics for this fold
fold_preds %>%
group_by(name, lambda) %>%
metric_set(rmse, mae, mape)(., price, .pred) %>%
select(-.estimator) %>%
arrange(.metric, lambda)
})) %>%
# summarise the model metrics for each value of lambda
unnest(metrics) %>%
group_by(name, .metric) %>%
summarise(lambda = unique(lambda),
mean = mean(.estimate),
se = sd(.estimate))
}
# A tibble: 150 × 5
# Groups: name [50]
name .metric lambda mean se
<chr> <chr> <dbl> <dbl> <dbl>
1 s0 mae 1 0.104 0.00353
2 s0 mape 1 2.59 0.0903
3 s0 rmse 1 0.140 0.00490
4 s1 mae 0.625 0.0891 0.00292
5 s1 mape 0.625 2.22 0.0757
6 s1 rmse 0.625 0.121 0.00451
7 s10 mae 0.00910 0.0327 0.00130
8 s10 mape 0.00910 0.817 0.0360
9 s10 rmse 0.00910 0.0454 0.00366
10 s11 mae 0.00569 0.0325 0.00132
# … with 140 more rows
toyota_tuning_ridge_min <- toyota_tuning_ridge %>%
group_by(.metric) %>%
filter(mean == min(mean))
toyota_tuning_ridge %>%
ggplot(aes(lambda, mean)) +
geom_errorbar(aes(ymin = mean - se,
ymax = mean + se)) +
geom_line() +
geom_point(data = toyota_tuning_ridge_min,
color = 'red') +
facet_wrap(~.metric, scales = "free") +
scale_x_log10(name = latex2exp::TeX("\\lambda"))
Metric | \(\lambda\) | Mean | Std. Error |
---|---|---|---|
rmse | 0.0001326 | 0.0445490 | 0.0039625 |
mae | 0.0000000 | 0.0322999 | 0.0013496 |
mape | 0.0000000 | 0.8054924 | 0.0367295 |
glmnet
is different to lm
that used a symbolic model formula.tidymodels
📦parsnip
(part of tidymodels
) standardises this interface.workflow(toyota_recipe) %>%
add_model(linear_reg(engine = "glmnet", penalty = best_lambda_ridge, mixture = 0)) %>%
fit(testing(toyota_splits)) %>%
tidy()
# A tibble: 27 × 3
term estimate penalty
<chr> <dbl> <dbl>
1 (Intercept) 4.11 0.000133
2 year 0.0827 0.000133
3 mileage -0.0463 0.000133
4 tax -0.00728 0.000133
5 mpg -0.00608 0.000133
6 engineSize 0.0369 0.000133
7 model_Avensis -0.0310 0.000133
8 model_Aygo -0.129 0.000133
9 model_C.HR 0.115 0.000133
10 model_Camry 0.0283 0.000133
# … with 17 more rows
\[{\hat{\boldsymbol\beta}_{lasso} = \underset{\boldsymbol\beta\in\mathbb{R}^{p+1}}{\mbox{arg min}}\ \text{RSS}(\boldsymbol{\beta})}\quad \color{#006DAE}{\text{subject to } \sum_{j=1}^p|\beta_j|\le s}.\]
X
: \(\beta_1\)Y
: \(\beta_2\)X
: \(\beta_1\)Y
: \(\beta_2\)Z
: \(\text{RSS}(\beta_1, \beta_2)\)\[\hat{\boldsymbol\beta} = \underset{\boldsymbol\beta\in\mathbb{R}^{p+1}}{\mbox{arg min}}\left\{~\text{RSS}(\boldsymbol{\beta}) + \underbrace{\lambda \sum_{j=1}^p |\beta_j|}_{\large\text{shrinkage penalty}}\right\}\]
glmnet
package.alpha = 1
(again, we’ll explain why later).library(broom) # for tidy
est_lasso <- fit_lasso %>%
tidy(return_zeros = TRUE)
est_lasso %>%
filter(term != "(Intercept)") %>%
mutate(zero = estimate == 0) %>%
ggplot(aes(lambda, estimate)) +
geom_line(data = ~rename(., term2 = term),
aes(group = term2),
color = "grey") +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_line(aes(color = zero, group = term)) +
facet_wrap(~term, nrow = 4) +
labs(x = latex2exp::TeX("\\lambda"),
y = "Standardised coefficients") +
guides(color = "none") +
scale_x_log10()
toyota_tuning_lasso <- cv_glmnet_to_toyota(alpha = 1)
toyota_tuning_lasso_min <- toyota_tuning_lasso %>%
group_by(.metric) %>%
filter(mean == min(mean))
toyota_tuning_lasso %>%
ggplot(aes(lambda, mean)) +
geom_errorbar(aes(ymin = mean - se,
ymax = mean + se)) +
geom_line() +
geom_point(data = toyota_tuning_lasso_min,
color = 'red') +
facet_wrap(~.metric, scales = "free") +
scale_x_log10(name = latex2exp::TeX("\\lambda"))
Metric | \(\lambda\) | Mean | Std. Error |
---|---|---|---|
mae | 0 | 0.0323048 | 0.0017061 |
mape | 0 | 0.8057043 | 0.0447169 |
rmse | 0 | 0.0446230 | 0.0039084 |
\[\hat{\boldsymbol\beta}_{glasso} = \underset{\boldsymbol\beta\in\mathbb{R}^{p+1}}{\mbox{arg min}}\left\{~\text{RSS}(\boldsymbol{\beta}) + \lambda \sum_{j}||\boldsymbol{\beta}_j||\right\}\]
grpreg
package. [1] "year" "mileage" "tax" "mpg" "engineSize"
[6] "model" "model" "model" "model" "model"
[11] "model" "model" "model" "model" "model"
[16] "model" "model" "model" "model" "model"
[21] "model" "model" "transmission" "transmission" "transmission"
[26] "fuelType" "fuelType" "fuelType"
est_glasso <- fit_glasso %>%
broom:::tidy.glmnet(return_zeros = TRUE)
est_glasso %>%
mutate(zero = estimate == 0) %>%
filter(term != "(Intercept)") %>%
ggplot(aes(lambda, estimate)) +
geom_line(data = ~rename(., term2 = term),
aes(group = term2),
color = "grey") +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_line(aes(color = zero, group = term)) +
facet_wrap(~term, nrow = 4) +
labs(x = latex2exp::TeX("\\lambda"),
y = "Standardised coefficients") +
scale_x_log10() +
guides(color = "none")
scroll
bind_rows(mutate(est_lasso, type = "lasso"),
mutate(est_glasso, type = "glasso")) %>%
mutate(type = fct_inorder(type),
zero = estimate == 0) %>%
filter(str_detect(term, "^(model_|transmission_)")) %>%
ggplot(aes(lambda, estimate)) +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_line(aes(group = interaction(term, type), color = type)) +
facet_wrap(~term) +
labs(x = latex2exp::TeX("\\lambda"),
y = "Standardised coefficients") +
scale_x_log10() +
scale_color_manual("", values = c("royalblue2", "violetred2"))
Lasso
Group lasso
Ridge regression
\[\hat{\boldsymbol\beta}_{elastic} = \underset{\boldsymbol\beta\in\mathbb{R}^{p+1}}{\mbox{arg min}}\ \left\{\text{RSS}(\boldsymbol{\beta})+\lambda\left(\frac{1-\alpha}{2}\sum_{j=1}^p\beta_j^2+\alpha\sum_{j=1}^p|\beta_j|\right)\right\}.\]
ETC3250/5250 Week 3B