Department of Econometrics and Business Statistics
Resampling
Resampling is a process of making new data based on observed data.
Commonly used resampling methods include:
Cross validation
often used for model assessment.
Bootstrapping
often used to provide a measure of accuracy of a parameter estimate.
Recall: training, validation and testing data sets
In Lecture 1, randomly splitting 75% of data into training data and the remaining 25% into testing data, then measured the accuracy of model trained on the training data using the testing data.
But the model accuracy changes on a different split.
Cross-validation extends this approach to:
several iterations of different splits, and
combine (typically by averaging) the model accuracy across the iterations.
Cross validation
\(k\)-fold cross validation
Partition samples into \(k\) (near) equal sized subsamples (referred to as folds).
Fit the model on \(k − 1\) subsets, and compute a metric, e.g. RMSE, on the omitted subset.
Repeat \(k\) times omitting a different subset each time.
Cross validation accuracy
Choice of \(k = 10\) is common.
Recall from Lecture 1 there are a number of ways to measure preditive accuracy, e.g. RMSE, MAE, MAPE and MPE.
Cross-validation accuracy is calculated:
by calculating the accuracy (based on one metric) on every fold, then
combining this into a single number, e.g. by taking a simple average.
\(k\)-fold cross validation with rsample
\(k = 5\) fold cross validation (\(k =\)v in the rsample package)
library(tidyverse)library(rsample)toyota <-read_csv("https://emitanaka.org/iml/data/toyota.csv") set.seed(2023) # to replicate resulttoyota_folds <-vfold_cv(toyota, v =5) toyota_foldslibrary(tidyverse)library(rsample)toyota <-read_csv("https://emitanaka.org/iml/data/toyota.csv") set.seed(2023) # to replicate resulttoyota_folds <-vfold_cv(toyota, v =5) toyota_folds
toyota_metrics <- toyota_models %>%mutate(across(c(reg, tree, knn), function(models) {# now for every fold and model, map2(splits, models, function(.split, .model) {testing(.split) %>%# compute prediction for testing setmutate(.pred =10^predict(.model, .)) %>%# then get metricsmetric_set(rmse, mae, mape, mpe, rsq)(., price, .pred) %>%# in a one-row data frame such that # column names are metric, # values are the accuracy measurepivot_wider(-.estimator,names_from = .metric, values_from = .estimate) }) }, .names ="{.col}_metrics"))toyota_metricstoyota_metrics <- toyota_models %>%mutate(across(c(reg, tree, knn), function(models) {# now for every fold and model, map2(splits, models, function(.split, .model) {testing(.split) %>%# compute prediction for testing setmutate(.pred =10^predict(.model, .)) %>%# then get metricsmetric_set(rmse, mae, mape, mpe, rsq)(., price, .pred) %>%# in a one-row data frame such that # column names are metric, # values are the accuracy measurepivot_wider(-.estimator,names_from = .metric, values_from = .estimate) }) }, .names ="{.col}_metrics"))toyota_metricstoyota_metrics <- toyota_models %>%mutate(across(c(reg, tree, knn), function(models) {# now for every fold and model, map2(splits, models, function(.split, .model) {testing(.split) %>%# compute prediction for testing setmutate(.pred =10^predict(.model, .)) %>%# then get metricsmetric_set(rmse, mae, mape, mpe, rsq)(., price, .pred) %>%# in a one-row data frame such that # column names are metric, # values are the accuracy measurepivot_wider(-.estimator,names_from = .metric, values_from = .estimate) }) }, .names ="{.col}_metrics"))toyota_metricstoyota_metrics <- toyota_models %>%mutate(across(c(reg, tree, knn), function(models) {# now for every fold and model, map2(splits, models, function(.split, .model) {testing(.split) %>%# compute prediction for testing setmutate(.pred =10^predict(.model, .)) %>%# then get metricsmetric_set(rmse, mae, mape, mpe, rsq)(., price, .pred) %>%# in a one-row data frame such that # column names are metric, # values are the accuracy measurepivot_wider(-.estimator,names_from = .metric, values_from = .estimate) }) }, .names ="{.col}_metrics"))toyota_metricstoyota_metrics <- toyota_models %>%mutate(across(c(reg, tree, knn), function(models) {# now for every fold and model, map2(splits, models, function(.split, .model) {testing(.split) %>%# compute prediction for testing setmutate(.pred =10^predict(.model, .)) %>%# then get metricsmetric_set(rmse, mae, mape, mpe, rsq)(., price, .pred) %>%# in a one-row data frame such that # column names are metric, # values are the accuracy measurepivot_wider(-.estimator,names_from = .metric, values_from = .estimate) }) }, .names ="{.col}_metrics"))toyota_metricstoyota_metrics <- toyota_models %>%mutate(across(c(reg, tree, knn), function(models) {# now for every fold and model, map2(splits, models, function(.split, .model) {testing(.split) %>%# compute prediction for testing setmutate(.pred =10^predict(.model, .)) %>%# then get metricsmetric_set(rmse, mae, mape, mpe, rsq)(., price, .pred) %>%# in a one-row data frame such that # column names are metric, # values are the accuracy measurepivot_wider(-.estimator,names_from = .metric, values_from = .estimate) }) }, .names ="{.col}_metrics"))toyota_metricstoyota_metrics <- toyota_models %>%mutate(across(c(reg, tree, knn), function(models) {# now for every fold and model, map2(splits, models, function(.split, .model) {testing(.split) %>%# compute prediction for testing setmutate(.pred =10^predict(.model, .)) %>%# then get metricsmetric_set(rmse, mae, mape, mpe, rsq)(., price, .pred) %>%# in a one-row data frame such that # column names are metric, # values are the accuracy measurepivot_wider(-.estimator,names_from = .metric, values_from = .estimate) }) }, .names ="{.col}_metrics"))toyota_metrics
\(k\)-fold cross validation with \(k < n\) has a computational advantage over LOOCV (\(k = n\)).
LOOCV is preferred to \(k\)-fold cross validation in the perspective of bias reduction (almost all data are used to estimate the model).
\(k\)-fold cross validation is preferred to LOOCV in the perspective of lower variance (the \(n\) fitted models in LOOCV are going to be highly positively correlated).
We usually select \(k=5\) or \(k=10\) to balance the bias-variance trade-off.
Bootstrap
Bootstrap samples
A bootstrap sample is created by sampling with replacement the original data with the same dimension as the original data.
set.seed(2023) # to replicate resultstoyota %>%sample_n(size =n(), replace =TRUE)
In a bootstrap sample, some observations may appear more than once or even some not at all.
When constructing a bootstrap sample split into training and testing dataset, becareful to ensure that the testing dataset only contains out-of-bag (OOB) samples.
OOB samples are observations that are not included in the bootstrap sample.
rsample::bootstraps function ensures testing data only contains OOB samples.
Some combinations of resampling schemes will result in the same observations appearing in both the training and testing/validation set.
rsample::nested_cv gives some warning for bad combination but be cautious of this!
toyota %>%nested_cv(outside =bootstraps(times =10),inside =vfold_cv(v =5))
Warning: Using bootstrapping as the outer resample is dangerous since the inner
resample might have the same data point in both the analysis and assessment
set.
Resampling - table of contents ETC3250/5250 Resampling Resampling Recall: training, validation and testing data sets Cross validation \(k\) -fold cross validation Cross validation accuracy \(k\) -fold cross validation with rsample Measuring accuracy for a single fold Fitting the models for each fold Measuring accuracy for each fold Examining the fold results Getting the cross validation metrics Visualising with parallel coordinate plots Leave-one-out cross validation (LOOCV) Bias-variance tradeoff for cross validation Bootstrap Bootstrap samples Out-of-bag samples for bootstraps Nested cross validation Nested cross validation Nested cross validation with R Cautionary tale for nested cross validation Takeaways