Introduction to Machine Learning
Lecturer: Emi Tanaka
Department of Econometrics and Business Statistics
We cover the following methods:
We cover the following methods:
Pricing a second-hand 2004 Toyota Yaris car
Pricing a second-hand 2004 Toyota Yaris car
For now don’t worry how this is calculated - we will learn this later.
Pricing a second-hand 2004 Toyota Yaris car
Pricing a second-hand 2004 Toyota Yaris car
Pricing a second-hand 2004 Toyota Yaris car
Method | Price |
Linear regression | €2,502.19 |
Regression tree | €2,172.45 |
k-nearest neighbour | €2,025.40 |
Neural network | €2,130.74 |
Image from American Cancer Society website
Image from Street et al. (1993) Nuclear feature extraction for breast tumor diagnosis. Biomedical Image Processing and Biomedical Visualization 1905
Again don’t worry how this is calculated - we will cover this in Week 4.
: Customer’s unique identifierYear_Birth
: Customer’s birth yearEducation
: Customer’s education levelMarital_Status
: Customer’s marital statusIncome
: Customer’s yearly household incomeKidhome
: Number of children in customer’s householdTeenhome
: Number of teenagers in customer’s householdDt_Customer
: Date of customer’s enrollment with the companyRecency
: Number of days since customer’s last purchaseComplain
: 1 if the customer complained in the last 2 years, 0 otherwiseProducts
: Amount spent on wine in last 2 yearsMntFruits
: Amount spent on fruits in last 2 yearsMntMeatProducts
: Amount spent on meat in last 2 yearsMntFishProducts
: Amount spent on fish in last 2 yearsMntSweetProducts
: Amount spent on sweets in last 2 yearsMntGoldProds
: Amount spent on gold in last 2 yearsPromotion
: Number of purchases made with a discountAcceptedCmp1
: 1 if customer accepted the offer in the 1st campaign, 0 otherwiseAcceptedCmp2
: 1 if customer accepted the offer in the 2nd campaign, 0 otherwiseAcceptedCmp3
: 1 if customer accepted the offer in the 3rd campaign, 0 otherwiseAcceptedCmp4
: 1 if customer accepted the offer in the 4th campaign, 0 otherwiseAcceptedCmp5
: 1 if customer accepted the offer in the 5th campaign, 0 otherwiseResponse
: 1 if customer accepted the offer in the last campaign, 0 otherwisePlace
: Number of purchases made through the company’s websiteNumCatalogPurchases
: Number of purchases made using a catalogueNumStorePurchases
: Number of purchases made directly in storesNumWebVisitsMonth
: Number of visits to company’s website in the last monthy = f(x_1, x_2, ..., x_p) + e.
Consider a multiple linear regression:
y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + ... + \beta_p x_{ip} + e_i,\quad\text{for }i = 1, ..., n
Notice we do not use i when we refer to a generic variable, use i for observations and j for features.
y_i = \beta_0 + \sum_{j=1}^p\beta_jx_{ij} + e_i = \sum_{j=0}^p\beta_jx_{ij} + e_i,\quad\text{for }i = 1, ..., n
\boldsymbol{y} = \mathbf{X}_{n\times (p + 1)}\boldsymbol{\beta} + \boldsymbol{e}, \quad\text{assuming } \boldsymbol{e}\sim N(\boldsymbol{0}, \sigma^2\mathbf{I}_{n\times n}).
\begin{bmatrix}2 & 3 \\4 & 3 \\ 1 & 1\end{bmatrix}\qquad\begin{bmatrix}-3 & 3 \\2 & 9 \end{bmatrix}\qquad\begin{bmatrix}1 & 3 & -1 \\ -4 & 0 & 0\end{bmatrix}
\mathbf{X}_{n\times p} = \begin{bmatrix}x_{11} & x_{12} & \cdots & x_{1p}\\ x_{21} & x_{22} & \cdots & x_{2p}\\ \vdots & \vdots & \ddots & \vdots\\ x_{n1} & x_{n2} & \cdots & x_{np}\end{bmatrix}
\mathbf{D} = \text{diag}(d_{1}, d_{2}, d_{3}, d_4) = \begin{bmatrix} d_{1} & 0 & 0 & 0\\0 & d_{2} & 0& 0\\ 0 & 0 & d_3 & 0 \\ 0 & 0 & 0 & d_4 \end{bmatrix}\qquad \mathbf{I}_3 = \begin{bmatrix}1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}
\boldsymbol{y} = \begin{bmatrix}y_1 \\ y_2\\ \vdots \\ y_m\end{bmatrix}
\mathbf{X} = \begin{bmatrix}x_{11} & x_{12} & x_{13}\\ x_{21} & x_{22} & x_{23}\end{bmatrix}\qquad \mathbf{X}^\top = \begin{bmatrix}x_{11} & x_{21} \\ x_{12} & x_{22} \\ x_{13} & x_{23} \end{bmatrix}
\boldsymbol{y} = \begin{bmatrix}y_1\\y_2 \\ \vdots\\ y_n\\\end{bmatrix}\qquad\boldsymbol{y}^\top = \begin{bmatrix}y_1 & y_2 & \cdots & y_n \end{bmatrix}
\begin{align*} MSE = E\left[y-\hat{y}\right]^2 &= E\left[f(x_1, ..., x_p) + e - \hat{f}(x_1, ..., x_p)\right]^2\\ &= \underbrace{E\left[f(x_1, ..., x_p) - \hat{f}(x_1, ..., x_p)\right]^2}_{\text{reducible}} + \underbrace{\text{var}(e)}_{\text{irreducible}} \\ &= \text{bias}[\hat{f}(x)]^2 + \text{var}[\hat{f}(x)] + \text{var}(e) \end{align*}
Image from Emi’s blog
Some measures include (note: lower magnitude is better):
Rows: 6,738
Columns: 9
$ model <chr> "GT86", "GT86", "GT86", "GT86", "GT86", "GT86", "GT86", "…
$ year <dbl> 2016, 2017, 2015, 2017, 2017, 2017, 2017, 2017, 2020, 201…
$ price <dbl> 16000, 15995, 13998, 18998, 17498, 15998, 18522, 18995, 2…
$ transmission <chr> "Manual", "Manual", "Manual", "Manual", "Manual", "Manual…
$ mileage <dbl> 24089, 18615, 27469, 14736, 36284, 26919, 10456, 12340, 5…
$ fuelType <chr> "Petrol", "Petrol", "Petrol", "Petrol", "Petrol", "Petrol…
$ tax <dbl> 265, 145, 265, 150, 145, 260, 145, 145, 150, 265, 265, 14…
$ mpg <dbl> 36.2, 36.2, 36.2, 36.2, 36.2, 36.2, 36.2, 36.2, 33.2, 36.…
$ engineSize <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
\log_{10}(\texttt{price}_i) = \beta_0 + \beta_1 \texttt{year}_i + \epsilon_i
in R asbroom
package to get the estimate of model parameters for many modelslog10(price) | year | .fitted | .hat | .sigma | .cooksd | .std.resid |
4.204120 | 2016 | 4.011465 | 0.0001655 | 0.1702762 | 0.0001060 | 1.1314938 |
4.203984 | 2017 | 4.062510 | 0.0001504 | 0.1702836 | 0.0000519 | 0.8308979 |
4.146066 | 2015 | 3.960421 | 0.0002418 | 0.1702773 | 0.0001438 | 1.0903664 |
4.278708 | 2017 | 4.062510 | 0.0001504 | 0.1702720 | 0.0001212 | 1.2697597 |
4.242988 | 2017 | 4.062510 | 0.0001504 | 0.1702781 | 0.0000845 | 1.0599745 |
4.204066 | 2017 | 4.062510 | 0.0001504 | 0.1702836 | 0.0000520 | 0.8313763 |
is a generic (S3) function that works for many kind of model objectsrsample
to split the data into the training and testing data:library(rsample)
set.seed(1) # to replicate the result
toyota_split <- initial_split(toyota, prop = 0.75)
- a simple linear regression model,tree
- a regression tree, andknn
- a k-nearest neighbourresults_test <- imap_dfr(model_fits, ~{
toyota_test %>%
select(year, price) %>%
mutate(.model = .y,
.pred = 10^predict(.x, .))
year | price | .model | .pred |
2017 | 15995 | reg | 11543.68 |
2017 | 17498 | reg | 11543.68 |
2017 | 18995 | reg | 11543.68 |
2020 | 27998 | reg | 16522.75 |
2016 | 13990 | reg | 10243.10 |
2017 | 17990 | reg | 11543.68 |
results_pred <- imap_dfr(model_fits, ~{
tibble(year = seq(min(toyota$year), max(toyota$year))) %>%
mutate(.pred = 10^predict(.x, .),
.model = .y)
gres <- ggplot(toyota_train, aes(year, price)) +
geom_point(alpha = 0.5) +
geom_line(data = results_pred, aes(y = .pred, color = .model)) +
scale_color_manual(values = c("#C8008F", "#006DAE", "#008A25")) +
scale_y_log10(label = scales::dollar_format(prefix = "€", accuracy = 1)) +
theme(legend.position = "bottom") +
labs(title = "Training data", y = "", color = "") +
guides(color = "none")
.model | rmse | mae | mape | mpe | rsq |
knn | 5723.4 | 4332.7 | 36.6 | -14.2 | 0.175 |
reg | 5761.3 | 4198.7 | 33.2 | -6.9 | 0.171 |
tree | 5691.4 | 4141.4 | 33.1 | -7.3 | 0.190 |
ETC3250/5250 Week 1