Ensemble methods combine predictions from multiple models.
The idea is that the combining results from multiple models can (but not always) result in a better predictive performance than a single model by “averaging out” the individual weaknesses (i.e. errors).
These aggregation generally reduces the variance.
However, the aggregation process often loses the interpretation of the single models.
Tree ensemble methods
Some well-known tree ensemble methods include:
Bagging: combining predictions based on bootstrap samples.
Random forests: combining predictions based on bagging and random subset selection of predictors.
Boosting: combining predictions from models sequentially fit to residuals from previous fit.
Bagging
Bagging ensemble learning
Bootstrap aggregating, or bagging for short, is an ensemble learning method that combines predictions from trees to bootstrapped data.
Recall boostrap involves resampling observations with replacement.
Random forests build an ensemble of deep independent trees.
Boosted trees build an ensemble of shallow trees in sequence with each tree learning and improving on the previous one.
Boosting for regression trees first tree
Suppose we fit a (shallow) regression tree, T^1, to y_1, y_2, \dots, y_n for the set of predictors \boldsymbol{x}_1, \boldsymbol{x}_2, \dots, \boldsymbol{x}_p.
We then use r_1^1, r_2^1, \dots, r_n^1 as your “response” data to fit the next tree, T^2, using the same set of predictors \boldsymbol{x}_1, \boldsymbol{x}_2, \dots, \boldsymbol{x}_p.
Each iteration (tree) increases accuracy a little but too many iterations result in overfitting.
Smaller \lambda slows the learning rate.
Each tree is shallow – if the tree has only one split, it is called a stump.
Other ensemble tree methods
Adaptive boosting
Adaptive boosting, or AdaBoost, is a type of boosting method where data for successive iterations of trees are based on weighted data.
There is a higher weight put on observations that were wrongly classified or has large error.
AdaBoost can be considered as a special case of (extreme) gradient boosting.
Gradient boosting
Gradient boosting involves a loss function, L(y_i|f_m) where \hat{y}_i^m = f_m(\boldsymbol{x}_i|T^m).
The choice of the loss function depends on the context, e.g. sum of the squared error may be used for regression problems and logarithmic loss function is used for classification problems.
The loss function must be differentiable.
Compute the residuals as r_i^m = - \frac{\partial L(y_i|f_m(\boldsymbol{x}_i))}{\partial f_m(\boldsymbol{x}_i)}.
Extreme gradient boosting
Extreme gradient boosted trees, or XGBoost, makes improvements to gradient boosting algorithms.
XGBoost implements many optimisation methods that allow for computationally fast fit of the model (e.g. parallelised tree building, cache awareness computing, efficient handling of missing data).
XGBoost also implements algorithmic techniques to ensure better model fit (e.g. regularisation to avoid overfitting, in-build cross-validation, pruning trees based on depth-first approach).
Combining predictions
Business decisions
Suppose a corporation needs to make a decision, e.g. deciding
between strategy A, B or C (classification problem) or
how much to spend (regression problem).
An executive board is presented with recommendations from experts.
Experts
👨✈️
🕵️♀️
🧑🎨
🧑🎤
🧑💻
👷
👨🔧
Classification
A
B
C
B
A
C
B
Regression
$9.3M
$9.2M
$8.9M
$3.1M
$9.2M
$8.9M
$9.4M
What would your final decision be for each problem?
Ensemble learning
Experts
👨✈️
🕵️♀️
🧑🎨
🧑🎤
🧑💻
👷
👨🔧
Classification
A
B
C
B
A
C
B
Regression
$9.3M
$9.2M
$8.9M
$3.1M
$9.2M
$8.9M
$9.4M
Combining predictions from multiple models depends on whether you have a regression or classification problem.
The simplest approach for:
regression problems is to take the average ($8.1M here), and
classification problems is to take the majority vote (B here).