## ETC3250/5250

Introduction to Machine Learning

### Tree ensemble methods

Lecturer: Emi Tanaka

Department of Econometrics and Business Statistics

## Decision trees

• Often a single tree suffer from high variance, but low bias.
• This means that your fitted tree may be quite different if you fit to subsets of the data, say.
Code
library(tidyverse)
library(rpart)
library(rpart.plot)
set.seed(1)
mutate(diagnosis = factor(diagnosis, levels = c("M", "B"))) %>%
janitor::clean_names() %>%
select(-x33, -id)
n <- nrow(cancer)
index1 <- sample(1:n, size = n/2, replace = FALSE)
index2 <- setdiff(1:n, index1)
fit1 <- rpart(diagnosis ~ ., data = cancer[index1, ], method = "class")
fit2 <- rpart(diagnosis ~ ., data = cancer[index2, ], method = "class")

## Ensemble methods

• Ensemble methods combine predictions from multiple models.
• The idea is that the combining results from multiple models can (but not always) result in a better predictive performance than a single model by “averaging out” the individual weaknesses (i.e. errors).
• These aggregation generally reduces the variance.
• However, the aggregation process often loses the interpretation of the single models.

## Tree ensemble methods

• Some well-known tree ensemble methods include:
• Bagging: combining predictions based on bootstrap samples.
• Random forests: combining predictions based on bagging and random subset selection of predictors.
• Boosting: combining predictions from models sequentially fit to residuals from previous fit.

# Bagging

## Bagging ensemble learning

• Bootstrap aggregating, or bagging for short, is an ensemble learning method that combines predictions from trees to bootstrapped data.
• Recall boostrap involves resampling observations with replacement.

## Trees for boostrapped data

insurance <- read_csv("https://emitanaka.org/iml/data/insurance.csv") %>%
mutate(across(where(is.character), as.factor))

boostrap_tree <- function() {
insurance %>%
slice_sample(n = nrow(.), replace = TRUE) %>% # bootstrapping
rpart(charges ~ ., data = .)
}
rpart.plot(boostrap_tree())
rpart.plot(boostrap_tree())
• Note that the tree structures are similar here!

## Drawback of bagging

• In bagging we sample across observations.
• The same predictors are used across trees, resulting typically in a similar tree structure if there are any underlying strong relationships.
• So even though the model building processes are independent, the trees are often highly correlated.
• It can also be costly to compute if data includes large number of predictors as each bagged tree is constructed based on all predictors.

# Random forestss

## Random forests

• Random forest overcomes the drawback of bagging by sampling across observations and variables.
• This approach reduces the correlation between trees, and thus reducing variance.
• It only considers subsets of predictors for each tree so it’s faster to compute.

## Trees for random forest

set.seed(2023)
p <- ncol(insurance) - 1 # = 6
yindex <- which(names(insurance)=="charges")
m <- floor(sqrt(p)) # = 2
random_tree <- function() {
data <- insurance %>%
slice_sample(n = nrow(.), replace = TRUE) %>% # observation bootstrapping
select(charges, sample(setdiff(1:(p + 1), yindex), size = m)) # sample subset of features
print(colnames(data))
rpart(charges ~ ., data = data)
}
rpart.plot(random_tree())
 "charges" "bmi"     "age"    
rpart.plot(random_tree())
 "charges"  "sex"      "children"

# Boosting

## From random forests to boosting

• Random forests build an ensemble of deep independent trees.
• Boosted trees build an ensemble of shallow trees in sequence with each tree learning and improving on the previous one. ## Boosting for regression trees first tree

• Suppose we fit a (shallow) regression tree, T^1, to y_1, y_2, \dots, y_n for the set of predictors \boldsymbol{x}_1, \boldsymbol{x}_2, \dots, \boldsymbol{x}_p.
T1 <- rpart(charges ~ ., data = insurance, maxdepth = 1, cp = 0)
rpart.plot(T1)

## Boosting for regression trees first tree

• We denote
• \hat{y}_i^1 as the prediction (or fitted value) for the i-th observation from T^1,
• r_i^1 = y_i - \lambda\hat{y}_i^1 as the residual of T^1 for the i-th observation,
• \lambda is the shrinkage parameter (e.g. \lambda = 0.1 slows the learning rate).
pred1 <- predict(T1)
lambda <- 1
res1 <- insurance$charges - lambda * pred1 ggplot(tibble(res1), aes(res1)) + geom_histogram() ## Boosting for regression trees second tree • We then use r_1^1, r_2^1, \dots, r_n^1 as your “response” data to fit the next tree, T^2, using the same set of predictors \boldsymbol{x}_1, \boldsymbol{x}_2, \dots, \boldsymbol{x}_p. T2 <- rpart(res1 ~ . - charges, data = insurance, maxdepth = 1, cp = 0) rpart.plot(T2) ## Boosting for regression trees second tree • The prediction of r_i^1 from T^2 is denoted as \hat{r}_i^1. • The predicted value of y_i from T^2 is then given as \hat{y}_i^2 = \hat{y}_i^1 + \hat{r}_i^1. pred2 <- predict(T2) res2 <- insurance$charges - lambda * (pred1 + pred2)
ggplot(tibble(res2), aes(res2)) + geom_histogram()

## Boosting for regression trees third tree

T3 <- rpart(res2 ~ . - charges, data = insurance, maxdepth = 1, cp = 0)
rpart.plot(T3)
pred3 <- predict(T3)
res3 <- insurance$charges - lambda * (pred1 + pred2 + pred3) ggplot(tibble(res3), aes(res3)) + geom_histogram() ## Boosting for regression trees fourth tree T4 <- rpart(res3 ~ . - charges, data = insurance, maxdepth = 1, cp = 0) rpart.plot(T4) pred4 <- predict(T4) res4 <- insurance$charges - lambda * (pred1 + pred2 + pred3 + pred4)
ggplot(tibble(res4), aes(res4)) + geom_histogram()

## Boosting for regression trees fifth tree

T5 <- rpart(res4 ~ . - charges, data = insurance, maxdepth = 1, cp = 0)
rpart.plot(T5)
pred5 <- predict(T5)
res5 <- insurance$charges - lambda * (pred1 + pred2 + pred3 + pred4 + pred5) ggplot(tibble(res5), aes(res5)) + geom_histogram() ## Boosting for regression trees hundredth tree scroll res <- list(); Ts <- list(); pred_total <- rep(0, nrow(insurance)); J <- 100 for(i in 1:J) { res[[i]] <- insurance$charges - lambda * pred_total
Ts[[i]] <- rpart(res[[i]] ~ . - charges, data = insurance, maxdepth = 1, cp = 0)
pred_total <- pred_total + predict(Ts[[i]])
}
ggplot(tibble(res = res[[J]]), aes(res)) + geom_histogram()

## Boosting

• We repeat this iteration J times.
• Each iteration (tree) increases accuracy a little but too many iterations result in overfitting.
• Smaller \lambda slows the learning rate.
• Each tree is shallow – if the tree has only one split, it is called a stump.

# Other ensemble tree methods

• Adaptive boosting, or AdaBoost, is a type of boosting method where data for successive iterations of trees are based on weighted data.
• There is a higher weight put on observations that were wrongly classified or has large error.
• AdaBoost can be considered as a special case of (extreme) gradient boosting.

• Gradient boosting involves a loss function, L(y_i|f_m) where \hat{y}_i^m = f_m(\boldsymbol{x}_i|T^m).
• The choice of the loss function depends on the context, e.g. sum of the squared error may be used for regression problems and logarithmic loss function is used for classification problems.
• The loss function must be differentiable.
• Compute the residuals as r_i^m = - \frac{\partial L(y_i|f_m(\boldsymbol{x}_i))}{\partial f_m(\boldsymbol{x}_i)}.

• Extreme gradient boosted trees, or XGBoost, makes improvements to gradient boosting algorithms.
• XGBoost is one of the most dominant methods for Kaggle competitions.
• XGBoost implements many optimisation methods that allow for computationally fast fit of the model (e.g. parallelised tree building, cache awareness computing, efficient handling of missing data).
• XGBoost also implements algorithmic techniques to ensure better model fit (e.g. regularisation to avoid overfitting, in-build cross-validation, pruning trees based on depth-first approach).

# Combining predictions

• Suppose a corporation needs to make a decision, e.g. deciding
• between strategy A, B or C (classification problem) or
• how much to spend (regression problem).
• An executive board is presented with recommendations from experts.
Experts 👨‍✈️ 🕵️‍♀️ 🧑‍🎨 🧑‍🎤 🧑‍💻 👷 👨‍🔧
Classification A B C B A C B
Regression $9.3M$9.2M $8.9M$3.1M $9.2M$8.9M $9.4M • What would your final decision be for each problem? ## Ensemble learning Experts 👨‍✈️ 🕵️‍♀️ 🧑‍🎨 🧑‍🎤 🧑‍💻 👷 👨‍🔧 Classification A B C B A C B Regression$9.3M $9.2M$8.9M $3.1M$9.2M $8.9M$9.4M
• Combining predictions from multiple models depends on whether you have a regression or classification problem.
• The simplest approach for:
• regression problems is to take the average ($8.1M here), and • classification problems is to take the majority vote (B here). • Other methods to combine predictions may be used. # Applications in R ## Data library(rsample) set.seed(2023) 1. Regression problem with the insurance data insurance_split <- insurance %>% mutate(across(where(is.character), as.factor)) %>% initial_split(prop = 3/4) insurance_train <- training(insurance_split) insurance_test <- testing(insurance_split) 1. Classification problem with the cancer data cancer_split <- cancer %>% mutate(diagnosis = ifelse(diagnosis == "M", 1, 0)) %>% initial_split(prop = 3/4) cancer_train <- training(cancer_split) cancer_test <- testing(cancer_split) ## Bagging trees with ipred library(ipred) • For insurance data: reg_bag <- bagging(charges ~ ., data = insurance_train, nbagg = 1000, control = rpart.control(cp = 0)) • For cancer data: class_bag <- bagging(diagnosis ~ ., data = cancer_train, nbagg = 1000, control = rpart.control(cp = 0)) • nbagg is the number of bagged trees to use. ## Random forests with ranger library(ranger) • For insurance data: reg_rf <- ranger(charges ~ ., data = insurance_train, mtry = floor((ncol(insurance_train) - 1) / 3), importance = "impurity", num.trees = 500) • For cancer data: class_rf <- ranger(diagnosis ~ ., data = cancer_train, mtry = floor((ncol(cancer_train) - 1) / 3), importance = "impurity", num.trees = 500, classification = TRUE) • mtry is the number of predictors to use in each node ## Boosted trees with gbm scroll Note: this is not pure boosting alone (it implements bagging as well). library(gbm) • For insurance data: reg_boost <- gbm(charges ~ ., data = insurance_train, distribution = "gaussian", # squared error n.trees = 10000, shrinkage = 0.01, interaction.depth = 1, cv.folds = 10) • For cancer data: class_boost <- gbm(diagnosis ~ ., data = cancer_train, distribution = "bernoulli", # binary logistic regression n.trees = 10000, shrinkage = 0.01, interaction.depth = 1, cv.folds = 10) • shrinkage is \lambda, • n.trees are the number of iterations/trees, • interaction.depth is the maximum depth of tree, • cv.folds is the number of folds for internal cross-validation error calculation. ## Cross-validation to select J Jreg <- gbm.perf(reg_boost, method = "cv") Jreg  1130 Jclass <- gbm.perf(class_boost, method = "cv") Jclass  4886 ## Optimal boosted tree • For insurance data: reg_boost_optimal <- gbm(charges ~ ., data = insurance_train, distribution = "gaussian", n.trees = Jreg, shrinkage = 0.01, interaction.depth = 1) • For cancer data: class_boost_optimal <- gbm(diagnosis ~ ., data = cancer_train, distribution = "bernoulli", n.trees = Jclass, shrinkage = 0.01, interaction.depth = 1) ## Extreme gradient boosted trees with xgboost library(xgboost) • For insurance data: reg_xgb <- xgboost(data = model.matrix(~ . - charges, data = insurance_train)[, -1], label = insurance_train$charges,
max.depth = 2,
eta = 1,
nrounds = 10,
verbose = 0)
• For cancer data:
class_xgb <- xgboost(data = model.matrix(~ . - diagnosis, data = cancer_train)[, -1],
label = cancer_train$diagnosis, max.depth = 2, eta = 1, nrounds = 10, objective = "binary:logistic", verbose = 0) ## Comparison for insurance data library(yardstick) list(bagged = reg_bag, randomforest = reg_rf, boost = reg_boost_optimal, xgboost = reg_xgb) %>% imap_dfr(function(model, name) { insurance_test %>% mutate(pred = switch(name, randomforest = predict(model, .)$predictions,
xgboost = predict(model, model.matrix(~ . - charges, data = .)[, -1]),
predict(model, .))) %>%
metric_set(rmse, mae, mape)(., charges, pred) %>%
mutate(name = name) %>%
pivot_wider(id_cols = name, names_from = .metric, values_from = .estimate)
})
# A tibble: 4 × 4
name          rmse   mae  mape
<chr>        <dbl> <dbl> <dbl>
1 bagged       4479. 2510.  28.0
2 randomforest 4610. 2798.  36.6
3 boost        5956. 4261.  46.0
4 xgboost      4676. 2847.  32.5

## Comparison for cancer data

list(bagged = class_bag,
randomforest = class_rf,
boost = class_boost_optimal,
xgboost = class_xgb) %>%
imap_dfr(function(model, name) {
cancer_test %>%
mutate(prob = switch(name,
randomforest = predict(model, .)\$predictions,
xgboost = predict(model, model.matrix(~ . - diagnosis, data = .)[, -1]),
predict(model, ., type = "response")),
pred = factor(ifelse(prob > 0.5, 1, 0)),
diagnosis = factor(diagnosis)) %>%
metric_set(accuracy, bal_accuracy, kap, roc_auc)(., diagnosis, prob, estimate = pred, event_level = "second") %>%
mutate(name = name) %>%
pivot_wider(id_cols = name, names_from = .metric, values_from = .estimate)
})
# A tibble: 4 × 5
name         accuracy bal_accuracy   kap roc_auc
<chr>           <dbl>        <dbl> <dbl>   <dbl>
1 bagged          0.944        0.938 0.881   0.974
2 randomforest    0.951        0.944 0.896   0.944
3 boost           0.937        0.926 0.865   0.990
4 xgboost         0.951        0.944 0.896   0.990

# Diagnostics

## Variable importance

scroll

library(vip)
• For insurance data
vi(reg_boost_optimal)
# A tibble: 6 × 2
Variable Importance
<chr>         <dbl>
1 smoker      78.0
2 age         14.4
3 bmi          6.66
4 children     0.746
5 region       0.184
6 sex          0.0542
vi(reg_rf)
# A tibble: 6 × 2
Variable   Importance
<chr>           <dbl>
1 smoker   89013357987.
2 bmi      21547358187.
3 age      19695751919.
4 children  3404902595.
5 region    2621029173.
6 sex       1153341291.
vi(reg_xgb)
# A tibble: 6 × 2
Variable        Importance
<chr>                <dbl>
1 smokeryes         0.730
2 bmi               0.162
3 age               0.0996
4 children          0.00571
5 regionsouthwest   0.00142
6 sexmale           0.000703
vip(reg_boost_optimal)
vip(reg_rf)
vip(reg_xgb)
• For cancer data
vi(class_boost_optimal)
# A tibble: 31 × 2
Variable             Importance
<chr>                     <dbl>
1 concave_points_worst      31.8
2 perimeter_worst           18.7
3 area_worst                15.7
4 concave_points_mean       14.2
5 area_se                    5.76
7 texture_worst              1.62
8 smoothness_worst           1.41
9 concavity_worst            1.34
10 symmetry_worst             1.05
# ℹ 21 more rows
vi(class_rf)
# A tibble: 31 × 2
Variable             Importance
<chr>                     <dbl>
1 concave_points_worst      34.7
2 area_worst                31.8
3 perimeter_worst           30.8
4 concave_points_mean       23.9
6 concavity_mean             6.49
7 area_se                    6.43
8 area_mean                  4.76
10 texture_worst              3.50
# ℹ 21 more rows
vi(class_xgb)
# A tibble: 20 × 2
Variable             Importance
<chr>                     <dbl>
1 concave_points_worst   0.639
2 area_worst             0.112
3 perimeter_worst        0.0956
4 area_se                0.0492
5 texture_mean           0.0242
6 texture_worst          0.0162
7 concavity_worst        0.0134
8 concave_points_mean    0.0124
10 fractal_dimension_se   0.00678
11 smoothness_mean        0.00495
12 id                     0.00400
13 area_mean              0.00316
15 concavity_mean         0.00244
16 symmetry_se            0.00199
17 symmetry_worst         0.00175
18 smoothness_se          0.00133
19 compactness_se         0.000706
20 radius_worst           0.000341
vip(class_boost_optimal)
vip(class_rf)
vip(class_xgb)
• If the value is large, then the variable is very important.

# Takeaways

• Tree methods can be extended by using ensemble approaches, e.g. bagging, random forest and boosting.
• Each of these methods have pros and cons.
• All these methods can be applied to both regression and classification. 