Introduction to Machine Learning
Lecturer: Emi Tanaka
Department of Econometrics and Business Statistics
We cover the following methods:
We cover the following methods:
Pricing a second-hand 2004 Toyota Yaris car
Pricing a second-hand 2004 Toyota Yaris car
For now don’t worry how this is calculated - we will learn this later.
Pricing a second-hand 2004 Toyota Yaris car
Pricing a second-hand 2004 Toyota Yaris car
Pricing a second-hand 2004 Toyota Yaris car
Method | Price |
---|---|
Linear regression | €2,502.19 |
Regression tree | €2,172.45 |
k-nearest neighbour | €2,025.40 |
Neural network | €2,130.74 |
Image from American Cancer Society website
Image from Street et al. (1993) Nuclear feature extraction for breast tumor diagnosis. Biomedical Image Processing and Biomedical Visualization 1905 https://doi.org/10.1117/12.148698
Again don’t worry how this is calculated - we will cover this in Week 4.
scroll
People
ID
: Customer’s unique identifierYear_Birth
: Customer’s birth yearEducation
: Customer’s education levelMarital_Status
: Customer’s marital statusIncome
: Customer’s yearly household incomeKidhome
: Number of children in customer’s householdTeenhome
: Number of teenagers in customer’s householdDt_Customer
: Date of customer’s enrollment with the companyRecency
: Number of days since customer’s last purchaseComplain
: 1 if the customer complained in the last 2 years, 0 otherwiseProducts
MntWines
: Amount spent on wine in last 2 yearsMntFruits
: Amount spent on fruits in last 2 yearsMntMeatProducts
: Amount spent on meat in last 2 yearsMntFishProducts
: Amount spent on fish in last 2 yearsMntSweetProducts
: Amount spent on sweets in last 2 yearsMntGoldProds
: Amount spent on gold in last 2 yearsPromotion
NumDealsPurchases
: Number of purchases made with a discountAcceptedCmp1
: 1 if customer accepted the offer in the 1st campaign, 0 otherwiseAcceptedCmp2
: 1 if customer accepted the offer in the 2nd campaign, 0 otherwiseAcceptedCmp3
: 1 if customer accepted the offer in the 3rd campaign, 0 otherwiseAcceptedCmp4
: 1 if customer accepted the offer in the 4th campaign, 0 otherwiseAcceptedCmp5
: 1 if customer accepted the offer in the 5th campaign, 0 otherwiseResponse
: 1 if customer accepted the offer in the last campaign, 0 otherwisePlace
NumWebPurchases
: Number of purchases made through the company’s websiteNumCatalogPurchases
: Number of purchases made using a catalogueNumStorePurchases
: Number of purchases made directly in storesNumWebVisitsMonth
: Number of visits to company’s website in the last monthy = f(x_1, x_2, ..., x_p) + e.
Consider a multiple linear regression:
y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + ... + \beta_p x_{ip} + e_i,\quad\text{for }i = 1, ..., n
Notice we do not use i when we refer to a generic variable, use i for observations and j for features.
y_i = \beta_0 + \sum_{j=1}^p\beta_jx_{ij} + e_i = \sum_{j=0}^p\beta_jx_{ij} + e_i,\quad\text{for }i = 1, ..., n
\boldsymbol{y} = \mathbf{X}_{n\times (p + 1)}\boldsymbol{\beta} + \boldsymbol{e}, \quad\text{assuming } \boldsymbol{e}\sim N(\boldsymbol{0}, \sigma^2\mathbf{I}_{n\times n}).
\begin{bmatrix}2 & 3 \\4 & 3 \\ 1 & 1\end{bmatrix}\qquad\begin{bmatrix}-3 & 3 \\2 & 9 \end{bmatrix}\qquad\begin{bmatrix}1 & 3 & -1 \\ -4 & 0 & 0\end{bmatrix}
\mathbf{X}_{n\times p} = \begin{bmatrix}x_{11} & x_{12} & \cdots & x_{1p}\\ x_{21} & x_{22} & \cdots & x_{2p}\\ \vdots & \vdots & \ddots & \vdots\\ x_{n1} & x_{n2} & \cdots & x_{np}\end{bmatrix}
\mathbf{D} = \text{diag}(d_{1}, d_{2}, d_{3}, d_4) = \begin{bmatrix} d_{1} & 0 & 0 & 0\\0 & d_{2} & 0& 0\\ 0 & 0 & d_3 & 0 \\ 0 & 0 & 0 & d_4 \end{bmatrix}\qquad \mathbf{I}_3 = \begin{bmatrix}1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}
\boldsymbol{y} = \begin{bmatrix}y_1 \\ y_2\\ \vdots \\ y_m\end{bmatrix}
\mathbf{X} = \begin{bmatrix}x_{11} & x_{12} & x_{13}\\ x_{21} & x_{22} & x_{23}\end{bmatrix}\qquad \mathbf{X}^\top = \begin{bmatrix}x_{11} & x_{21} \\ x_{12} & x_{22} \\ x_{13} & x_{23} \end{bmatrix}
\boldsymbol{y} = \begin{bmatrix}y_1\\y_2 \\ \vdots\\ y_n\\\end{bmatrix}\qquad\boldsymbol{y}^\top = \begin{bmatrix}y_1 & y_2 & \cdots & y_n \end{bmatrix}
\begin{align*} MSE = E\left[y-\hat{y}\right]^2 &= E\left[f(x_1, ..., x_p) + e - \hat{f}(x_1, ..., x_p)\right]^2\\ &= \underbrace{E\left[f(x_1, ..., x_p) - \hat{f}(x_1, ..., x_p)\right]^2}_{\text{reducible}} + \underbrace{\text{var}(e)}_{\text{irreducible}} \\ &= \text{bias}[\hat{f}(x)]^2 + \text{var}[\hat{f}(x)] + \text{var}(e) \end{align*}
Image from Emi’s blog
Some measures include (note: lower magnitude is better):
Rows: 6,738
Columns: 9
$ model <chr> "GT86", "GT86", "GT86", "GT86", "GT86", "GT86", "GT86", "…
$ year <dbl> 2016, 2017, 2015, 2017, 2017, 2017, 2017, 2017, 2020, 201…
$ price <dbl> 16000, 15995, 13998, 18998, 17498, 15998, 18522, 18995, 2…
$ transmission <chr> "Manual", "Manual", "Manual", "Manual", "Manual", "Manual…
$ mileage <dbl> 24089, 18615, 27469, 14736, 36284, 26919, 10456, 12340, 5…
$ fuelType <chr> "Petrol", "Petrol", "Petrol", "Petrol", "Petrol", "Petrol…
$ tax <dbl> 265, 145, 265, 150, 145, 260, 145, 145, 150, 265, 265, 14…
$ mpg <dbl> 36.2, 36.2, 36.2, 36.2, 36.2, 36.2, 36.2, 36.2, 33.2, 36.…
$ engineSize <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
\log_{10}(\texttt{price}_i) = \beta_0 + \beta_1 \texttt{year}_i + \epsilon_i
lm()
in R asbroom
package to get the estimate of model parameters for many modelslog10(price) | year | .fitted | .hat | .sigma | .cooksd | .std.resid |
---|---|---|---|---|---|---|
4.204120 | 2016 | 4.011465 | 0.0001655 | 0.1702762 | 0.0001060 | 1.1314938 |
4.203984 | 2017 | 4.062510 | 0.0001504 | 0.1702836 | 0.0000519 | 0.8308979 |
4.146066 | 2015 | 3.960421 | 0.0002418 | 0.1702773 | 0.0001438 | 1.0903664 |
4.278708 | 2017 | 4.062510 | 0.0001504 | 0.1702720 | 0.0001212 | 1.2697597 |
4.242988 | 2017 | 4.062510 | 0.0001504 | 0.1702781 | 0.0000845 | 1.0599745 |
4.204066 | 2017 | 4.062510 | 0.0001504 | 0.1702836 | 0.0000520 | 0.8313763 |
predict
is a generic (S3) function that works for many kind of model objectsrsample
to split the data into the training and testing data:library(rsample)
set.seed(1) # to replicate the result
toyota_split <- initial_split(toyota, prop = 0.75)
toyota_split
<Training/Testing/Total>
<5053/1685/6738>
reg
- a simple linear regression model,tree
- a regression tree, andknn
- a k-nearest neighbourresults_test <- imap_dfr(model_fits, ~{
toyota_test %>%
select(year, price) %>%
mutate(.model = .y,
.pred = 10^predict(.x, .))
})
head(results_test)
year | price | .model | .pred |
---|---|---|---|
2017 | 15995 | reg | 11543.68 |
2017 | 17498 | reg | 11543.68 |
2017 | 18995 | reg | 11543.68 |
2020 | 27998 | reg | 16522.75 |
2016 | 13990 | reg | 10243.10 |
2017 | 17990 | reg | 11543.68 |
results_pred <- imap_dfr(model_fits, ~{
tibble(year = seq(min(toyota$year), max(toyota$year))) %>%
mutate(.pred = 10^predict(.x, .),
.model = .y)
})
gres <- ggplot(toyota_train, aes(year, price)) +
geom_point(alpha = 0.5) +
geom_line(data = results_pred, aes(y = .pred, color = .model)) +
scale_color_manual(values = c("#C8008F", "#006DAE", "#008A25")) +
scale_y_log10(label = scales::dollar_format(prefix = "€", accuracy = 1)) +
theme(legend.position = "bottom") +
labs(title = "Training data", y = "", color = "") +
guides(color = "none")
gres
.model | rmse | mae | mape | mpe | rsq |
---|---|---|---|---|---|
knn | 5723.4 | 4332.7 | 36.6 | -14.2 | 0.175 |
reg | 5761.3 | 4198.7 | 33.2 | -6.9 | 0.171 |
tree | 5691.4 | 4141.4 | 33.1 | -7.3 | 0.190 |
Image from Emi’s blog
ETC3250/5250 Week 1