Introduction to Machine Learning
Lecturer: Emi Tanaka
Department of Econometrics and Business Statistics
We cover the following methods:
We cover the following methods:
Pricing a second-hand 2004 Toyota Yaris car
Pricing a second-hand 2004 Toyota Yaris car
For now don’t worry how this is calculated - we will learn this later.
Pricing a second-hand 2004 Toyota Yaris car
Pricing a second-hand 2004 Toyota Yaris car
Pricing a second-hand 2004 Toyota Yaris car
Method | Price |
---|---|
Linear regression | €2,502.19 |
Regression tree | €2,172.45 |
k-nearest neighbour | €2,025.40 |
Neural network | €2,130.74 |
Image from American Cancer Society website
Image from Street et al. (1993) Nuclear feature extraction for breast tumor diagnosis. Biomedical Image Processing and Biomedical Visualization 1905 https://doi.org/10.1117/12.148698
Again don’t worry how this is calculated - we will cover this in Week 4.
scroll
People
ID
: Customer’s unique identifierYear_Birth
: Customer’s birth yearEducation
: Customer’s education levelMarital_Status
: Customer’s marital statusIncome
: Customer’s yearly household incomeKidhome
: Number of children in customer’s householdTeenhome
: Number of teenagers in customer’s householdDt_Customer
: Date of customer’s enrollment with the companyRecency
: Number of days since customer’s last purchaseComplain
: 1 if the customer complained in the last 2 years, 0 otherwiseProducts
MntWines
: Amount spent on wine in last 2 yearsMntFruits
: Amount spent on fruits in last 2 yearsMntMeatProducts
: Amount spent on meat in last 2 yearsMntFishProducts
: Amount spent on fish in last 2 yearsMntSweetProducts
: Amount spent on sweets in last 2 yearsMntGoldProds
: Amount spent on gold in last 2 yearsPromotion
NumDealsPurchases
: Number of purchases made with a discountAcceptedCmp1
: 1 if customer accepted the offer in the 1st campaign, 0 otherwiseAcceptedCmp2
: 1 if customer accepted the offer in the 2nd campaign, 0 otherwiseAcceptedCmp3
: 1 if customer accepted the offer in the 3rd campaign, 0 otherwiseAcceptedCmp4
: 1 if customer accepted the offer in the 4th campaign, 0 otherwiseAcceptedCmp5
: 1 if customer accepted the offer in the 5th campaign, 0 otherwiseResponse
: 1 if customer accepted the offer in the last campaign, 0 otherwisePlace
NumWebPurchases
: Number of purchases made through the company’s websiteNumCatalogPurchases
: Number of purchases made using a catalogueNumStorePurchases
: Number of purchases made directly in storesNumWebVisitsMonth
: Number of visits to company’s website in the last monthy=f(x1,x2,...,xp)+e.
Consider a multiple linear regression:
yi=β0+β1xi1+β2xi2+...+βpxip+ei,for i=1,...,n
Notice we do not use i when we refer to a generic variable, use i for observations and j for features.
yi=β0+j=1∑pβjxij+ei=j=0∑pβjxij+ei,for i=1,...,n
y=Xn×(p+1)β+e,assuming e∼N(0,σ2In×n).
⎣⎡241331⎦⎤[−3239][1−430−10]
Xn×p=⎣⎡x11x21⋮xn1x12x22⋮xn2⋯⋯⋱⋯x1px2p⋮xnp⎦⎤
D=diag(d1,d2,d3,d4)=⎣⎡d10000d20000d30000d4⎦⎤I3=⎣⎡100010001⎦⎤
y=⎣⎡y1y2⋮ym⎦⎤
X=[x11x21x12x22x13x23]X⊤=⎣⎡x11x12x13x21x22x23⎦⎤
y=⎣⎡y1y2⋮yn⎦⎤y⊤=[y1y2⋯yn]
MSE=E[y−y^]2=E[f(x1,...,xp)+e−f^(x1,...,xp)]2=reducibleE[f(x1,...,xp)−f^(x1,...,xp)]2+irreduciblevar(e)=bias[f^(x)]2+var[f^(x)]+var(e)
Image from Emi’s blog
Some measures include (note: lower magnitude is better):
Rows: 6,738
Columns: 9
$ model <chr> "GT86", "GT86", "GT86", "GT86", "GT86", "GT86", "GT86", "…
$ year <dbl> 2016, 2017, 2015, 2017, 2017, 2017, 2017, 2017, 2020, 201…
$ price <dbl> 16000, 15995, 13998, 18998, 17498, 15998, 18522, 18995, 2…
$ transmission <chr> "Manual", "Manual", "Manual", "Manual", "Manual", "Manual…
$ mileage <dbl> 24089, 18615, 27469, 14736, 36284, 26919, 10456, 12340, 5…
$ fuelType <chr> "Petrol", "Petrol", "Petrol", "Petrol", "Petrol", "Petrol…
$ tax <dbl> 265, 145, 265, 150, 145, 260, 145, 145, 150, 265, 265, 14…
$ mpg <dbl> 36.2, 36.2, 36.2, 36.2, 36.2, 36.2, 36.2, 36.2, 33.2, 36.…
$ engineSize <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
log10(pricei)=β0+β1yeari+ϵi
lm()
in R asbroom
package to get the estimate of model parameters for many modelslog10(price) | year | .fitted | .hat | .sigma | .cooksd | .std.resid |
---|---|---|---|---|---|---|
4.204120 | 2016 | 4.011465 | 0.0001655 | 0.1702762 | 0.0001060 | 1.1314938 |
4.203984 | 2017 | 4.062510 | 0.0001504 | 0.1702836 | 0.0000519 | 0.8308979 |
4.146066 | 2015 | 3.960421 | 0.0002418 | 0.1702773 | 0.0001438 | 1.0903664 |
4.278708 | 2017 | 4.062510 | 0.0001504 | 0.1702720 | 0.0001212 | 1.2697597 |
4.242988 | 2017 | 4.062510 | 0.0001504 | 0.1702781 | 0.0000845 | 1.0599745 |
4.204066 | 2017 | 4.062510 | 0.0001504 | 0.1702836 | 0.0000520 | 0.8313763 |
predict
is a generic (S3) function that works for many kind of model objectsrsample
to split the data into the training and testing data:library(rsample)
set.seed(1) # to replicate the result
toyota_split <- initial_split(toyota, prop = 0.75)
toyota_split
<Training/Testing/Total>
<5053/1685/6738>
reg
- a simple linear regression model,tree
- a regression tree, andknn
- a k-nearest neighbourresults_test <- imap_dfr(model_fits, ~{ toyota_test %>% select(year, price) %>% mutate(.model = .y, .pred = 10^predict(.x, .)) }) head(results_test)
results_test <- imap_dfr(model_fits, ~{ toyota_test %>% select(year, price) %>% mutate(.model = .y, .pred = 10^predict(.x, .)) }) head(results_test)
results_test <- imap_dfr(model_fits, ~{ toyota_test %>% select(year, price) %>% mutate(.model = .y, .pred = 10^predict(.x, .)) }) head(results_test)
year | price | .model | .pred |
---|---|---|---|
2017 | 15995 | reg | 11543.68 |
2017 | 17498 | reg | 11543.68 |
2017 | 18995 | reg | 11543.68 |
2020 | 27998 | reg | 16522.75 |
2016 | 13990 | reg | 10243.10 |
2017 | 17990 | reg | 11543.68 |
results_pred <- imap_dfr(model_fits, ~{
tibble(year = seq(min(toyota$year), max(toyota$year))) %>%
mutate(.pred = 10^predict(.x, .),
.model = .y)
})
gres <- ggplot(toyota_train, aes(year, price)) +
geom_point(alpha = 0.5) +
geom_line(data = results_pred, aes(y = .pred, color = .model)) +
scale_color_manual(values = c("#C8008F", "#006DAE", "#008A25")) +
scale_y_log10(label = scales::dollar_format(prefix = "€", accuracy = 1)) +
theme(legend.position = "bottom") +
labs(title = "Training data", y = "", color = "") +
guides(color = "none")
gres
.model | rmse | mae | mape | mpe | rsq |
---|---|---|---|---|---|
knn | 5723.4 | 4332.7 | 36.6 | -14.2 | 0.175 |
reg | 5761.3 | 4198.7 | 33.2 | -6.9 | 0.171 |
tree | 5691.4 | 4141.4 | 33.1 | -7.3 | 0.190 |
ETC3250/5250 Week 1