ETC3250/5250

Introduction to Machine Learning

k-nearest neightbours

Lecturer: Emi Tanaka

Department of Econometrics and Business Statistics

Motivation

When new training data becomes available, the models so far require re-estimation.
Re-estimation can be time consuming.
k-nearest neighbours (kNN) requires no explicit training or model.

Neighbours

An observation j is said to be a neighbour of observation i if its predictor values \boldsymbol{x}_j are similar to the predictor values \boldsymbol{x}_i.
The k-nearest neighbours to observation i are the k observations, with the most similar predictor values to \boldsymbol{x}_i.
But how do we measure similarity?

Distance metrics

Simple example

Consider 3 individuals and 2 predictors (age and income):
- Mr Orange is 37 years old and earns $75K per year.
- Mr Red is 31 years old and earns $67K per year.
- Mr Blue is 30 years old and earns $69K per year.

Which is the nearest neighbour to Mr Red?

Computing distances

For two variables, we can use Pythagoras’ theorem to calculate the distances between individuals.

Distances:
- \sqrt{6^2 + 8^2} = 10
- \sqrt{1^2 + 2^2} \approx 2.2
- \sqrt{7^2 + 6^2} \approx 9.2
Mr Red is closest to Mr Blue.

But how do we compute the distances between observations when there are more than two variables?

Euclidean distance

Suppose \boldsymbol{x}_{i} is the value of the predictors for observation i, \boldsymbol{x}_{j} is the value of the predictors for observation j.

The Euclidean distance (ED) between observations i and j can be computed as

D_{Euclidean}\left(\boldsymbol{x}_{i},\boldsymbol{x}_{j}\right)=\sqrt{\sum\limits_{s=1}^p \left(x_{is}-x_{js}\right)^2}.

Manhattan distance

There are other ways to compute the distances.

Another distance metric is the Manhattan distance, also known as the block distance:

D_{Manhattan}\left(\boldsymbol{x}_{i},\boldsymbol{x}_{j}\right)=\sum\limits_{s=1}^p |x_{is}-x_{js}|.

Chebyshev distance

The Chebyshev distance is the maximum distance between any variables:

D_{Chebyshev}\left(\boldsymbol{x}_{i},\boldsymbol{x}_{j}\right)=\max_{s = 1, \dots, p} |x_{is}-x_{js}|.

Canberra distance

The Canberra distance is a weighted version of the Manhattan distance:

D_{Canberra}\left(\boldsymbol{x}_{i},\boldsymbol{x}_{j}\right)=\sum\limits_{s=1}^p \dfrac{|x_{is}-x_{js}|}{|x_{is}| + |x_{js}|}.

If x_{is} = x_{js} = 0 then it is omitted from the sum.

Jaccard distance

The Jaccard distance, or binary distance, measures the proportion of differences in non-zero elements between two variables out of elements where at least one variable is non-zero:

D_{Jaccard}\left(\boldsymbol{x}_{i},\boldsymbol{x}_{j}\right) = 1 - \dfrac{|A\cap B|}{|A\cup B|}, where

A = \{s~|~x_{is} = 0 \text{ for }s =1, \dots, p\} and
B = \{s~|~x_{js} = 0 \text{ for }s =1, \dots, p\}.

Minkowski distance

The Minkowski distance is a generalisation of the Euclidean distance (q = 2) and Manhattan distance (q = 1):

D_{Minkowski}\left(\boldsymbol{x}_{i},\boldsymbol{x}_{j}\right)=\left(\sum\limits_{s=1}^p \left|x_{is}-x_{js}\right|^q\right)^{\frac{1}{q}}.

Computing distances with R

simple <- data.frame(age = c(37, 31, 30),
                     income = c(75, 67, 69),
                     row.names = c("Orange", "Red", "Blue"))
simple

       age income
Orange  37     75
Red     31     67
Blue    30     69

dist(simple, method = "euclidean")

        Orange       Red
Red  10.000000          
Blue  9.219544  2.236068

Other methods available:

maximum (Chebyshev distance)
manhattan
canberra

binary (Jaccard distance)
minkowski

Standardising the variables

scroll

Data

simple1000 <- simple %>% 
  mutate(income = 1000 * income) 

simple1000

       age income
Orange  37  75000
Red     31  67000
Blue    30  69000

simple1000 %>%  # measure income in $ instead of $1000
  dist(method = "euclidean")

       Orange      Red
Red  8000.002         
Blue 6000.004 2000.000

The units of measurements are now very different across variables.
Here income in dollars contributes greater to the distance than age in years.
We commonly calculate distance after standardising the variables so the variables are in a comparable scale.

dist(scale(simple))

        Orange       Red
Red  2.4907701          
Blue 2.3442542 0.5482123

dist(scale(simple1000))

        Orange       Red
Red  2.4907701          
Blue 2.3442542 0.5482123

k-nearest neighbours

Notations for neighbours

{ \mathcal{N}_i^k} denotes a set of index of k observations with the smallest distance to observation i.
For instance, { \mathcal{N}^2_{10} =\{3, 5\}} indicates that the nearest neighbours for observation 10 are observations 3 and 5.
Alternatively, it can also be defined as {\mathcal{N}^k_i = \{j~|~D\left(\boldsymbol{x}_{i},\boldsymbol{x}_{j}\right)<\epsilon_k\}}.
All observations whose distance to \boldsymbol{x}_i is less than the positive scalar \epsilon_k.
The value \epsilon_k is chosen so that only k neighbours are chosen.

Illustration of neighbours 1

For observation 7 with k=1, we have { \mathcal{N}^1_7} = \{3\}.
Here we use Euclidean distance and \epsilon_k = 0.4.

Illustration of neighbours 2

{ \mathcal{N}_7^3} = \{3, 8, 9\}.

Illustration of neighbours 3

\mathcal{N}_{15}^3 = \{5, 11, 16\}

Illustration of neighbours 4

\mathcal{N}_{8}^3 = \{1, 10, 16\}

Your turn

Can you find the 3-nearest neighbours to observation 9?

The boundary region for Euclidean distance

With one predictor finding the nearest neighbours is finding the k observations inside an interval around x_i: [x_i-\epsilon_k,x_i+\epsilon_k].
With two predictors it is finding the k observations inside a circle around \boldsymbol{x}_i.
With three predictors it is finding the k observations inside a sphere around \boldsymbol{x}_i.
In high dimensions ( { p>3} ) it is finding the k observations inside a hyper-sphere around \boldsymbol{x}_i.

Illustration of neighbours: Manhattan distance

{ \mathcal{N}_7^3} = \{3, 8, 9\}.

Neighbours are selected within the boundary of a tilted square.

Prediction with kNN

Assume that we want to predict a new record with predictor values \boldsymbol{x}_\text{new}.
For classification problems, kNN predicts the outcome class in three steps:
1. Find the k-nearest neighbours \mathcal{N}^k_{\text{new}}.
2. Count how many neighbours belong to class 1 and to class 2.
3. Take as your prediction the class with the majority of votes.

Prediction with kNN: New data

Suppose that the new customer has a (standardized) age 0 with (standardized) income of 0.
Let’s use 5-nearest neighbours to predict whether they can pay their loan.

Prediction with kNN: Step 1

{ \mathcal{N}_{\text{new}}^5 = \{1, 6, 8, 10, 20\}}.

Prediction with kNN: Steps 2 and 3

Step 2:
- 2 votes for paid: 6 and 20.
- 3 votes for default: 1, 8 and 10.

Step 3: The new record is predicted to default on their loan.

Prediction with kNN: Propensity score

We can also think about the proportion of class 1 observations as a propensity score.
The propensity score is: { \text{Pr}\left(y_{\text{new}}=1|\boldsymbol{x}_{\text{new}}\right)= \frac{1}{k}\sum_{\boldsymbol{x}_j\in\mathcal{N}^k_{\text{new}}}I(y_j=1) }
Proportion of neighbours in favor of class 1.

Prediction with kNN: Propensity score calculation

To compute the propensity score of the new record in the example, paid is class 1 and defaulted class 2.
Then \begin{align*}P\left(y_{\text{new}} =1|\boldsymbol{x}_{\text{new}}\right) &= \frac{1}{5} (I(y_1=1)+I(y_6=1)+I(y_{8}=1)\\ &\qquad+I(y_{10}=1)+I(y_{20}=1))\\&=\frac{1}{5}(0 + 1 + 0 + 0 + 1)\\&=0.4.\end{align*}

An application to caravan data

The business problem

The Insurance Company (TIC) Benchmark is interested in increasing their business.
Their salesperson must visit each potential customer.
The company wants to use data on old customers to maximize the insurance purchases.
The predictions from the trained model can help the salesperson spend their time on a customer that is more likely to purchase the caravan insurance policy.

Sales of insurance policy with `caravan` data

scroll

The data contains 5822 customer records.
The full description of data can be found here.
Variables 1 to 43 are sociodemographic. Obtained from zip codes. Customers with the same zip code have identical attributes.
Variables 44 to 85 product ownership data.
Variable 86 (Purchase) indicates whether the customer purchased a caravan insurance policy.

library(tidyverse)
caravan <- read_csv("https://emitanaka.org/iml/data/caravan.csv") %>% 
  mutate(across(-Purchase, scale),
         Purchase = factor(Purchase))
skimr::skim(caravan)

Data summary
Name	caravan
Number of rows	5822
Number of columns	86
_______________________
Column type frequency:
factor	1
numeric	85
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Purchase	0	1	FALSE	2	No: 5474, Yes: 348

Variable type: numeric

skim_variable	complete_rate	sd	p0	p25	p50	p75	p100	hist
MOSTYPE	1	1	-1.81	-1.11	0.45	0.84	1.30	▅▂▂▂▇
MAANTHUI	1	1	-0.27	-0.27	-0.27	-0.27	21.90	▇▁▁▁▁
MGEMOMV	1	1	-2.13	-0.86	0.41	0.41	2.94	▁▆▇▂▁
MGEMLEEF	1	1	-2.44	-1.22	0.01	0.01	3.69	▅▇▃▁▁
MOSHOOFD	1	1	-1.67	-0.97	0.43	0.78	1.48	▃▃▃▇▃
MGODRK	1	1	-0.69	-0.69	-0.69	0.30	8.28	▇▂▁▁▁
MGODPR	1	1	-2.70	-0.37	0.22	0.80	2.55	▁▂▇▃▁
MGODOV	1	1	-1.05	-1.05	-0.07	0.91	3.86	▇▃▁▁▁
MGODGE	1	1	-2.04	-0.79	-0.16	0.46	3.59	▂▇▇▁▁
MRELGE	1	1	-3.24	-0.62	-0.10	0.43	1.48	▁▁▃▇▃
MRELSA	1	1	-0.91	-0.91	0.12	0.12	6.33	▇▂▁▁▁
MRELOV	1	1	-1.33	-0.75	-0.17	0.41	3.89	▅▇▂▁▁
MFALLEEN	1	1	-1.05	-1.05	0.06	0.62	3.95	▇▆▂▁▁
MFGEKIND	1	1	-1.99	-0.76	-0.14	0.48	3.56	▂▇▆▁▁
MFWEKIND	1	1	-2.14	-0.65	-0.15	0.85	2.34	▂▆▇▅▂
MOPLHOOG	1	1	-0.90	-0.90	-0.28	0.33	4.65	▇▃▁▁▁
MOPLMIDD	1	1	-1.90	-0.77	-0.20	0.37	3.21	▃▇▇▂▁
MOPLLAAG	1	1	-1.99	-0.68	0.19	0.62	1.93	▂▆▇▆▂
MBERHOOG	1	1	-1.05	-1.05	0.06	0.61	3.95	▇▆▂▁▁
MBERZELF	1	1	-0.51	-0.51	-0.51	0.78	5.94	▇▁▁▁▁
MBERBOER	1	1	-0.49	-0.49	-0.49	0.45	8.02	▇▁▁▁▁
MBERMIDD	1	1	-1.58	-0.49	0.05	0.60	3.32	▃▇▃▁▁
MBERARBG	1	1	-1.28	-0.70	-0.13	0.45	3.92	▆▇▃▁▁
MBERARBO	1	1	-1.36	-0.77	-0.18	0.41	3.95	▆▇▃▁▁
MSKA	1	1	-0.94	-0.94	-0.36	0.22	4.28	▇▅▁▁▁
MSKB1	1	1	-1.21	-0.46	0.30	0.30	5.56	▇▇▁▁▁
MSKB2	1	1	-1.44	-0.79	-0.13	0.52	4.44	▅▇▃▁▁
MSKC	1	1	-1.94	-0.91	0.12	0.64	2.71	▂▇▇▂▁
MSKD	1	1	-0.82	-0.82	-0.05	0.72	6.09	▇▂▁▁▁
MHHUUR	1	1	-1.37	-0.72	-0.08	0.89	1.54	▇▇▆▅▇
MHKOOP	1	1	-1.54	-0.90	0.07	0.72	1.37	▇▅▆▇▇
MAUT1	1	1	-3.89	-0.67	-0.03	0.62	1.91	▁▁▅▇▂
MAUT2	1	1	-1.09	-1.09	-0.26	0.57	4.72	▇▅▂▁▁
MAUT0	1	1	-1.22	-0.60	0.03	0.65	4.40	▆▇▂▁▁
MZFONDS	1	1	-3.17	-0.65	0.37	0.87	1.38	▁▂▅▇▅
MZPART	1	1	-1.38	-0.87	-0.37	0.64	3.16	▅▇▅▂▁
MINKM30	1	1	-1.23	-0.75	-0.28	0.68	3.08	▇▇▅▂▁
MINK3045	1	1	-1.88	-0.82	0.25	0.78	2.90	▂▇▇▂▁
MINK4575	1	1	-1.42	-0.90	0.14	0.66	3.25	▅▇▅▁▁
MINK7512	1	1	-0.68	-0.68	-0.68	0.18	7.06	▇▂▁▁▁
MINK123M	1	1	-0.37	-0.37	-0.37	-0.37	15.95	▇▁▁▁▁
MINKGEM	1	1	-2.87	-0.60	0.16	0.16	3.96	▁▇▇▂▁
MKOOPKLA	1	1	-1.61	-0.62	-0.12	0.88	1.88	▅▇▇▅▅
PWAPART	1	1	-0.80	-0.80	-0.80	1.28	2.32	▇▁▁▅▁
PWABEDR	1	1	-0.11	-0.11	-0.11	-0.11	16.43	▇▁▁▁▁
PWALAND	1	1	-0.14	-0.14	-0.14	-0.14	7.86	▇▁▁▁▁
PPERSAUT	1	1	-1.02	-1.02	0.69	1.04	1.72	▇▁▁▇▁
PBESAUT	1	1	-0.09	-0.09	-0.09	-0.09	13.08	▇▁▁▁▁
PMOTSCO	1	1	-0.20	-0.20	-0.20	-0.20	7.61	▇▁▁▁▁
PVRAAUT	1	1	-0.04	-0.04	-0.04	-0.04	36.74	▇▁▁▁▁
PAANHANG	1	1	-0.10	-0.10	-0.10	-0.10	23.40	▇▁▁▁▁
PTRACTOR	1	1	-0.15	-0.15	-0.15	-0.15	9.80	▇▁▁▁▁
PWERKT	1	1	-0.06	-0.06	-0.06	-0.06	26.15	▇▁▁▁▁
PBROM	1	1	-0.26	-0.26	-0.26	-0.26	7.11	▇▁▁▁▁
PLEVEN	1	1	-0.22	-0.22	-0.22	-0.22	9.80	▇▁▁▁▁
PPERSONG	1	1	-0.07	-0.07	-0.07	-0.07	28.61	▇▁▁▁▁
PGEZONG	1	1	-0.08	-0.08	-0.08	-0.08	15.51	▇▁▁▁▁
PWAOREG	1	1	-0.06	-0.06	-0.06	-0.06	18.59	▇▁▁▁▁
PBRAND	1	1	-0.97	-0.97	0.09	1.16	3.28	▇▅▃▁▁
PZEILPL	1	1	-0.02	-0.02	-0.02	-0.02	69.01	▇▁▁▁▁
PPLEZIER	1	1	-0.07	-0.07	-0.07	-0.07	21.91	▇▁▁▁▁
PFIETS	1	1	-0.16	-0.16	-0.16	-0.16	6.21	▇▁▁▁▁
PINBOED	1	1	-0.08	-0.08	-0.08	-0.08	29.25	▇▁▁▁▁
PBYSTAND	1	1	-0.12	-0.12	-0.12	-0.12	12.11	▇▁▁▁▁
AWAPART	1	1	-0.82	-0.82	-0.82	1.21	3.24	▇▁▆▁▁
AWABEDR	1	1	-0.11	-0.11	-0.11	-0.11	37.17	▇▁▁▁▁
AWALAND	1	1	-0.15	-0.15	-0.15	-0.15	6.89	▇▁▁▁▁
APERSAUT	1	1	-0.93	-0.93	0.72	0.72	10.65	▇▁▁▁▁
ABESAUT	1	1	-0.08	-0.08	-0.08	-0.08	30.69	▇▁▁▁▁
AMOTSCO	1	1	-0.18	-0.18	-0.18	-0.18	34.76	▇▁▁▁▁
AVRAAUT	1	1	-0.04	-0.04	-0.04	-0.04	47.72	▇▁▁▁▁
AAANHANG	1	1	-0.10	-0.10	-0.10	-0.10	23.75	▇▁▁▁▁
ATRACTOR	1	1	-0.14	-0.14	-0.14	-0.14	16.47	▇▁▁▁▁
AWERKT	1	1	-0.05	-0.05	-0.05	-0.05	48.26	▇▁▁▁▁
ABROM	1	1	-0.27	-0.27	-0.27	-0.27	7.28	▇▁▁▁▁
ALEVEN	1	1	-0.20	-0.20	-0.20	-0.20	20.99	▇▁▁▁▁
APERSONG	1	1	-0.07	-0.07	-0.07	-0.07	13.67	▇▁▁▁▁
AGEZONG	1	1	-0.08	-0.08	-0.08	-0.08	12.34	▇▁▁▁▁
AWAOREG	1	1	-0.06	-0.06	-0.06	-0.06	25.78	▇▁▁▁▁
ABRAND	1	1	-1.01	-1.01	0.76	0.76	11.44	▇▁▁▁▁
AZEILPL	1	1	-0.02	-0.02	-0.02	-0.02	44.04	▇▁▁▁▁
APLEZIER	1	1	-0.07	-0.07	-0.07	-0.07	24.43	▇▁▁▁▁
AFIETS	1	1	-0.15	-0.15	-0.15	-0.15	14.07	▇▁▁▁▁
AINBOED	1	1	-0.09	-0.09	-0.09	-0.09	22.02	▇▁▁▁▁
ABYSTAND	1	1	-0.12	-0.12	-0.12	-0.12	16.55	▇▁▁▁▁

kNN in R

First let’s separate the data into training and testing data.

library(rsample)
set.seed(124)
caravan_split <- caravan %>% 
  # standardise
  mutate(across(-Purchase, scale)) %>% 
  initial_split()

We apply kNN using kknn function from the kknn library:

library(kknn)
knn_pred <- kknn(Purchase ~ ., 
                 train = training(caravan_split),
                 test = testing(caravan_split),
                 k = 2,
                 # parameter of Minkowski distance 
                 # 2 = Euclidean distance 
                 # 1 = Manhattan distance
                 distance = 2)

Output object

The return object from kknn includes:
- fitted.values - vector of predictions
- prob - predicted class probabilities

knn_pred$fitted.values %>% head()

[1] Yes No  No  Yes No  No 
Levels: No Yes

knn_pred$prob %>% head()

            No       Yes
[1,] 0.1562373 0.8437627
[2,] 1.0000000 0.0000000
[3,] 1.0000000 0.0000000
[4,] 0.1562373 0.8437627
[5,] 1.0000000 0.0000000
[6,] 1.0000000 0.0000000

Selecting k

scroll

We can compute metrics, such as AUC for a range of ks.

library(yardstick)
kaucres <- map_dfr(2:100, function(k) {
    knn_pred <- kknn(Purchase ~ ., 
                   train = training(caravan_split),
                   test = testing(caravan_split),
                   k = k,
                   distance = 2)
    tibble(k = k, AUC = roc_auc_vec(testing(caravan_split)$Purchase, 
                                    knn_pred$prob[, 1]))
  })

kaucres

# A tibble: 99 × 2
       k   AUC
   <int> <dbl>
 1     2 0.569
 2     3 0.581
 3     4 0.580
 4     5 0.583
 5     6 0.598
 6     7 0.606
 7     8 0.602
 8     9 0.611
 9    10 0.601
10    11 0.605
# … with 89 more rows

Selecting k visually

scroll

Code

ggplot(kaucres, aes(k, AUC)) +
  geom_line() +
  scale_x_continuous(breaks = seq(5, 100, 5))

The AUC rises as k increases, however there is some diminishing return for larger k values.
We can see that the increase in AUC is not as sharp from k > 15, so we can suggest k = 15.
This visual selection of k is referred to as the elbow method.

kNN for other variable encodings

Categorical predictors

In the examples so far, the predictors were all numeric.

Categorical variables must be converted to dummy variables before distances can be computed.

kNN for numerical outcomes

We can also apply kNN to predict numerical responses.
The step of finding nearest neighbours is identical as to the case of categorical variables, however, the predictive step is different!
The predicted value is equal to the average of the outcome of the neighbours: {\hat{y}_{\text{new}} = f(\boldsymbol{x}_{\text{new}}) = \frac{1}{k}\sum_{j\in\mathcal{N}^k_{\text{new}}}y_j}.

Takeaways

kNN is simple and powerful – no complex parameters to tune.
No optimisation involved with kNN.
kNN can be however computationally expensive as the nearest neighbour for a new point requires computation of distance to all points.

ETC3250/5250

k-nearest neightbours

Motivation

Neighbours

Distance metrics

Simple example

Computing distances

Euclidean distance

Manhattan distance

Chebyshev distance

Canberra distance

Jaccard distance

Minkowski distance

Computing distances with R

Standardising the variables

k-nearest neighbours

Notations for neighbours

Illustration of neighbours 1

Illustration of neighbours 2

Illustration of neighbours 3

Illustration of neighbours 4

Your turn

The boundary region for Euclidean distance

Illustration of neighbours: Manhattan distance

Prediction with kNN

Prediction with kNN: New data

Prediction with kNN: Step 1

Prediction with kNN: Steps 2 and 3

Prediction with kNN: Propensity score

Prediction with kNN: Propensity score calculation

An application to caravan data

The business problem

Sales of insurance policy with caravan data

kNN in R

Output object

Selecting k

Selecting k visually

kNN for other variable encodings

Categorical predictors

kNN for numerical outcomes

Takeaways

Sales of insurance policy with `caravan` data