Method	Price
Linear regression	€2,502.19
Regression tree	€2,172.45
k-nearest neighbour	€2,025.40
Neural network	€2,130.74

We need maths

We need mathematical notations to form deeper understanding or discussion of ML models.
We’ll refer:
- $y$ as a response (also called dependent variable), and
- $x_j$ as the $j$ -th predictor (also called feature or explanatory variable).
We denote:
- $n$ for the total number of observations (or samples), and
- $p$ for the total number of predictors.

More notational matters

Predictions for
- $y_i$ are denoted as $\hat{y}_i = f(x_{i1}, x_{i2}, ..., x_{ip})$ .
- $e_i$ are denoted as $\hat{e}_i = y_i - \hat{y}_i$ and referred to as the residual for the $i$ -th observation.
Estimations of model parameters are denoted with $\hat{ }$ ,
- e.g., $\hat{\beta}_0$ is an estimate of the intercept in the previous multiplie linear regression model.

Notations for multiple linear regression

Consider a multiple linear regression:

$y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + ... + \beta_p x_{ip} + e_i,\quad\text{for }i = 1, ..., n$

$y_i$ is the $i$ -th observation of variable $y$ ,
$x_{ij}$ is the $i$ -th observation of the $j$ -th predictor,
$\beta_0$ is the intercept and $\beta_j$ is the slope or coefficient of $x_j$ , and
$e_i$ is the error term for the $i$ -th observation and is often assumed $e_i \sim N(0, \sigma^2)$ .

Notice we do not use $i$ when we refer to a generic variable, use $i$ for observations and $j$ for features.

Alternative notations for multiple linear regression

We can use a summation notation where $x_{i0} = 1$ :

$y_i = \beta_0 + \sum_{j=1}^p\beta_jx_{ij} + e_i = \sum_{j=0}^p\beta_jx_{ij} + e_i,\quad\text{for }i = 1, ..., n$

Or represent it using a matrix notation:

$\boldsymbol{y} = \mathbf{X}_{n\times (p + 1)}\boldsymbol{\beta} + \boldsymbol{e}, \quad\text{assuming } \boldsymbol{e}\sim N(\boldsymbol{0}, \sigma^2\mathbf{I}_{n\times n}).$

Properties of multiple linear regression

Least-squares or maximum likelihood estimate $\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{y}$
(assuming that $\mathbf{X}$ is full rank).
Fitted or predicted values: $\hat{y}_i = \sum_{j=0}^{p}\hat{\beta}_jx_{ij}$ .
Residual standard error: $RSE = \sqrt{\frac{1}{n - p - 1}\sum_{i=1}^n(y_i - \hat{y}_i)^2} = \sqrt{\frac{1}{n - p - 1}(\boldsymbol{y} - \mathbf{X}\hat{\boldsymbol{\beta}})^\top(\boldsymbol{y} - \mathbf{X}\hat{\boldsymbol{\beta}})}$ .
$RSE = \hat{\sigma}$ is an unbiased estimator of $\sigma$ .

Matrix

A matrix is a rectangular array of scalars (i.e. numbers).

$\begin{bmatrix}2 & 3 \\4 & 3 \\ 1 & 1\end{bmatrix}\qquad\begin{bmatrix}-3 & 3 \\2 & 9 \end{bmatrix}\qquad\begin{bmatrix}1 & 3 & -1 \\ -4 & 0 & 0\end{bmatrix}$

In this unit, a matrix is denoted as a bold capital letter, e.g. $\mathbf{A}$ , $\mathbf{X}$ and $\mathbf{B}$ .

Dimensions of a matrix

Occasionally the dimension of the matrix is denoted in the subscript:

$\mathbf{X}_{n\times p} = \begin{bmatrix}x_{11} & x_{12} & \cdots & x_{1p}\\ x_{21} & x_{22} & \cdots & x_{2p}\\ \vdots & \vdots & \ddots & \vdots\\ x_{n1} & x_{n2} & \cdots & x_{np}\end{bmatrix}$

Diagonal matrix

A diagonal matrix is a matrix where off-diagonal entries are zero.

$\mathbf{D} = \text{diag}(d_{1}, d_{2}, d_{3}, d_4) = \begin{bmatrix} d_{1} & 0 & 0 & 0\\0 & d_{2} & 0& 0\\ 0 & 0 & d_3 & 0 \\ 0 & 0 & 0 & d_4 \end{bmatrix}\qquad \mathbf{I}_3 = \begin{bmatrix}1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}$

An identity matrix of dimension $n$ is a $n\times n$ diagonal matrix denoted as $\mathbf{I}_n$ where all the diagonal entries is 1.

Transpose operator

A transposed matrix or vector is denoted with $\top$

$\mathbf{X} = \begin{bmatrix}x_{11} & x_{12} & x_{13}\\ x_{21} & x_{22} & x_{23}\end{bmatrix}\qquad \mathbf{X}^\top = \begin{bmatrix}x_{11} & x_{21} \\ x_{12} & x_{22} \\ x_{13} & x_{23} \end{bmatrix}$

$\boldsymbol{y} = \begin{bmatrix}y_1\\y_2 \\ \vdots\\ y_n\\\end{bmatrix}\qquad\boldsymbol{y}^\top = \begin{bmatrix}y_1 & y_2 & \cdots & y_n \end{bmatrix}$

Goodness of fit

Sum of squares of errors: $SSE = \sum_{i = 1}^n e_i^2 = \boldsymbol{e}^\top\boldsymbol{e}$
Residual sum of squares: $RSS = \sum_{i = 1}^n \hat{e}_i^2 = \sum_{i=1}^n (y_i - \hat{y}_i)^2$
Total sum of squares: $TSS = \sum_{i=1}^n (y_i - \bar{y})^2$
A goodness of the fit can be measured by
- $R^2 = 1 - \dfrac{RSS}{TSS}$ , the coefficient of determination, is the proportion of variation explained by the model, and
- $R^2_a = 1 - \dfrac{RSS/(n - p)}{TSS/ (n - 1)}$ , adjusted $R^2$ , is more appropriate as it, unlike $R^2$ , does not necessary increase with the addition of more predictors.

Mean square error

$\begin{align*} MSE = E\left[y-\hat{y}\right]^2 &= E\left[f(x_1, ..., x_p) + e - \hat{f}(x_1, ..., x_p)\right]^2\\ &= \underbrace{E\left[f(x_1, ..., x_p) - \hat{f}(x_1, ..., x_p)\right]^2}_{\text{reducible}} + \underbrace{\text{var}(e)}_{\text{irreducible}} \\ &= \text{bias}[\hat{f}(x)]^2 + \text{var}[\hat{f}(x)] + \text{var}(e) \end{align*}$

Training, validation and testing data sets

In machine learning, three data sets are commonly used to build and select the model:
- training data is used to fit the initial model
- validation data is used to evaluate the model fit from the training data to help tune the hyperparameters
- testing data (or holdout data) is used to evaluate the tuned model

We’ll denote the set of index of training data, validation data and testing data as $Train$ , $Valid$ and $Test$ , respectively.

Image from Emi’s blog

Predictive accuracy

Predictive accuracy of trained and tuned models should be measured on the testing data.

Some measures include (note: lower magnitude is better):
- Root mean squared error: $RMSE_{Test} = \sqrt{\frac{1}{|Test|} \sum_{i \in Test} \hat{e}_i^2}$
- Mean absolute error: $MAE_{Test} = \frac{1}{|Test|} \sum_{i \in Test} |\hat{e}_i|$
- Mean absolute percentage error: $MAPE_{Test} = \dfrac{100}{|Test|} \sum_{i \in Test} \left|\dfrac{\hat{e}_i}{y_i}\right|$
- Mean percentage error: $MPE_{Test} = \dfrac{100}{|Test|} \sum_{i \in Test} \dfrac{\hat{e}_i}{y_i}$ (note this can be a negative value)

Most often $RMSE_{Test} \geq RMSE_{Train}$ .

An application with maths and code

A linear regression with R

$\log_{10}(\texttt{price}_i) = \beta_0 + \beta_1 \texttt{year}_i + \epsilon_i$

This model is fitted using lm() in R as

fit <- lm(log10(price) ~ year, data = toyota)

And you can predict $y$ using predict():

10^predict(fit, data.frame(year = 2004))

       1 
2505.726

2505.73 is the predicted price for a Toyota car built in year 2004.

term	estimate	std.error	statistic	p.value
(Intercept)	-98.8939077	1.8982689	-52.09689	0
year	0.0510443	0.0009413	54.23026	0

r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual	nobs
0.3039109	0.3038075	0.1702797	2940.922	0	1	2368.56	-4731.121	-4710.674	195.3115	6736	6738

log10(price)	year	.fitted	.hat	.sigma	.cooksd	.std.resid
4.204120	2016	4.011465	0.0001655	0.1702762	0.0001060	1.1314938
4.203984	2017	4.062510	0.0001504	0.1702836	0.0000519	0.8308979
4.146066	2015	3.960421	0.0002418	0.1702773	0.0001438	1.0903664
4.278708	2017	4.062510	0.0001504	0.1702720	0.0001212	1.2697597
4.242988	2017	4.062510	0.0001504	0.1702781	0.0000845	1.0599745
4.204066	2017	4.062510	0.0001504	0.1702836	0.0000520	0.8313763

Train the ML models

Now we train three models:
- reg - a simple linear regression model,
- tree - a regression tree, and
- knn - a $k$ -nearest neighbour

library(rpart) # for regression tree
library(kknn) # for k-nearest neighbour
model_fits <- list(
  "reg" = lm(log10(price) ~ year,
             data = toyota_train),
  "tree" = rpart(log10(price) ~ year,
                 data = toyota_train,
                 method = "anova"),
  "knn" = train.kknn(log10(price) ~ year,
                     data = toyota_train)
)

year	price	.model	.pred
2017	15995	reg	11543.68
2017	17498	reg	11543.68
2017	18995	reg	11543.68
2020	27998	reg	16522.75
2016	13990	reg	10243.10
2017	17990	reg	11543.68

.model	rmse	mae	mape	mpe	rsq
knn	5723.4	4332.7	36.6	-14.2	0.175
reg	5761.3	4198.7	33.2	-6.9	0.171
tree	5691.4	4141.4	33.1	-7.3	0.190

Select a model

Based on the predictive accuracy, we may choose to select the regression tree for prediction (it has the best metric for all, except for $MPE$ ).
But for inference, simple linear regression model has an easier interpretability.
Selecting a model isn’t just about selecting the model with the best metric - data and problem context matters.

Prediction on plot

Image from Emi’s blog

What is the aim?

When you are considering to use ML methods, broadly there are two aims: prediction or inference.
In inference, we would like to understand how $y$ is related to $x_1, ..., x_p$ .
- Which predictors are associated with the response?
- What is the relationship between the response and each predictor?
Some ML methods may have good predictive performance but poor interpretability, i.e. you don’t understand how the ML method is making the prediction.

1 / 67

No notes on this slide.

Overview to machine learning - table of contents ETC3250/5250 Overview Machine learning Toyota dealer Car price as a function of year Simple linear regression Simple linear regression Machine learning methods ML paradigm ML paradigm ML paradigm (not covered in this unit) Supervised learning Unsupervised learning Illustration of supervised learning methods with numerical response Simple linear regression Regression trees k -nearest neighbour Neural network Comparing models for regression problem Illustration of supervised learning methods with categorical response Breast cancer diagnosis Features from the digitized image from FNA Logistic regression k -nearest neighbour Neural network Illustration of unsupervised learning Customer personality analysis Dimension reduction and clustering Statistical notations, concepts and terminologies We need maths Assumption for supervised learning More notational matters Notations for multiple linear regression Alternative notations for multiple linear regression Properties of multiple linear regression Linear algebra Matrix Dimensions of a matrix Diagonal matrix Vectors Transpose operator Resources for linear algebra Model evaluation Goodness of fit Mean square error Model bias and variance Training, validation and testing data sets Predictive accuracy An application with maths and code Used Toyota car listing A linear regression A linear regression with R Extracting model parameters in R Extracting model summaries in R Extractng model values in R Predicting from model fit with R Splitting data into testing and training data Train the ML models Predict response on testing data Visualising the predictions Predictive accuracies with R Select a model Cautionary tales What is the aim? It’s not necessary causation Beware of missing or inappropriate data Preprocessing the data Takeaways