ETC3250/5250

Introduction to Machine Learning

Support vector machines

Lecturer: Emi Tanaka

Department of Econometrics and Business Statistics

Hyperplane

In a p-dimensional space, a hyperplane is a linear subspace of dimension p - 1.
A p-dimensional hyperplane is given as

\beta_0 + \beta_1 x_1 + \dots + \beta_p x_p = 0.

Hyperplane in 2D

For p = 2, a hyperplane is a line.

Hyperplane in 3D

For p = 3, a hyperplane is a plane.

Separating hyperplanes

A hyperplane divides the p-space into two sides.
Suppose we have a new observation \boldsymbol{x}_0 = (x_{01}, x_{02}, \dots, x_{0p})^\top with g(x_0) = \beta_0 + \beta_1 x_{01} + \dots + \beta_p x_{0p}.
- \boldsymbol{x}_0 is on side 1 if g(\boldsymbol{x}_0) > 0.
- \boldsymbol{x}_0 is on side 2 if g(\boldsymbol{x}_0) < 0.
- \boldsymbol{x}_0 is on the hyperplane if g(\boldsymbol{x}_0) = 0.
The magnitude of g(\boldsymbol{x}_0) tells us how far from \boldsymbol{x}_0 is from the hyperplane.

Example

Suppose we have g(\boldsymbol{x}) = 1 + x_{1} - x_2.
Consider the points \boldsymbol{p}_1 = (0, 0)^\top and \boldsymbol{p}_2 = (0, 2)^\top.

\boldsymbol{p}_1 is on side 1.
\boldsymbol{p}_2 is on side 2.

SVM disambiguation

Support vector machine methods refer to three main approaches:
- Maximal margin classifier.
- Support vector classifier.
- Support vector machines.
To avoid confusion, we call these as support vector machine methods and reserve the name support vector machines for the third approach.

Maximal marginal classifier

Toy breast cancer diagnosis data

Infinite hyperplanes

There are infinite number of hyperplanes to perfectly classify these points. E.g.,
1. 28 - x_{1} - x_{2} = 0
2. 30 - 1.4x_{1} - x_{2} = 0
3. 18.5 - 0.5x_{1} - x_{2} = 0

All three do a perfect job to separate the classes.
Which one should be we choose?

Distance to hyperplane

We compute the (perpendicular) distance from each training observation to a given separating hyperplane.
The smallest such distance is known as the margin (M).
The maximal margin hyperplane is the separating hyperplane for which the margin is largest.
We can then classify a test observation based on which side of the maximal margin hyperplane it lies.
This is known as the maximal margin classifier.

Margin for hyperplane (a)

The margin for this hyperplane is 1.4.

Margin for hyperplane (b)

The margin for this hyperplane is 1.02.

Margin for hyperplane (c)

The margin for this hyperplane is 0.43.

Maximal margin classifier

The maximal margin hyperplane is the plane that correctly classifies all observations but also is farthest away from them.

Support vectors

There will always be at least two equidistant vectors vectors to the maximal margin hyperplane.
These are called support vectors – there are three in this example.

Maximal margin hyperplane

Suppose that y_i = \begin{cases}1 & \text{if observation $i$ in class 1}\\-1& \text{if observation $i$ in class 2}\end{cases}.
The maximal margin hyperplane is found from solving the optimisation problem: \hat{\boldsymbol{\beta}} = \arg\max_{\boldsymbol{\beta}\in\mathbb{R}^{p+1}}M subject to \sum_{j=1}^p \beta_j^2 = 1 and y_i \times g(\boldsymbol{x}_i) \geq M.

Non-separable case

The maximal margin classifier only works when we have perfect separability in our data.
What do we do if data is not perfectly separable by a hyperplane?
The support vector classifier allows points to either lie on the wrong side of the margin, or on the wrong side of the hyperplane altogether.

Support vector classifier

Support vector classifier allows some observation to be closer to the hyperplane than the support vectors.
It also allows some observations to be on the incorrect side of the hyperplane (i.e. allows for some misclassification).
It will try to classify the remaining observations to be classified correctly.

Optimisation

For support vector classification, we solve the optimisation problem: \hat{\boldsymbol{\beta}} = \arg\max_{\boldsymbol{\beta}\in\mathbb{R}^{p+1}}M subject to \sum_{j=1}^p \beta_j^2 = 1 and y_i \times g(\boldsymbol{x}_i) \geq M(1 - \varepsilon_i).

\varepsilon_i \geq 0 is called the slack variable and \sum_{i=1}^n\varepsilon_i \leq C.
- If 0 < \varepsilon_i \leq 1, observation i is correctly classified but it violates the margin.
- If \varepsilon_i > 1, observation is incorrectly classified.

Tuning parameter C

C restricts the magnitude of \varepsilon_i.
If C = 0, then all \varepsilon_i = 0 thus support vector classifier is equivalent to maximum margin classifier.
You can select C using cross-validation.

Nonlinear boundaries

The support vector classifier is a linear classifier.
It doesn’t work well for nonlinear boundaries.

Enlarging the feature space

We can make support vector classifier more flexible by adding higher order polynomial terms, e.g. x_1^2 and x_2^3.
We treat these new terms as predictors (referred to as enlarging the feature space).

Support vector machines

Adding polynomial terms makes support vector classifiers more flexible, but we have to explicitly specify this apriori.
Support vector machines enlarges the feature space without explicit specification of nonlinear terms apriori.
To understand support vector machines, we need to know about:
- the inner product,
- the dual representation, and
- kernel functions.

The inner product

Consider two p-vectors \begin{align*} \boldsymbol{x}_i & = (x_{i1}, x_{i2}, \dots, x_{ip}) \in \mathbb{R}^p \\ \text{and} \quad \boldsymbol{x}_k & = (x_{k1}, x_{k2}, \dots, x_{kp}) \in \mathbb{R}^p. \end{align*}
The inner product is defined as

\langle \boldsymbol{x}_i, \boldsymbol{x}_k\rangle = \boldsymbol{x}_i^\top \boldsymbol{x}_k = x_{i1}x_{k1} + x_{i2}x_{k2} + \dots + x_{ip}x_{kp} = \sum_{j=1}^{p} x_{ij}x_{kj}.

Dual representation

In support vector machines, we write \beta_j = \sum_{k=1}^n\alpha_ky_kx_{kj}.

This re-expresses g(\boldsymbol{x}) as g(\boldsymbol{x}_i) = \beta_0 + \sum_{j=1}^p\beta_jx_{ij} = \beta_0 + \sum_{k=1}^n\alpha_ky_k\langle\boldsymbol{x}_i, \boldsymbol{x}_k\rangle.

Dual representation

We can generalise the expression by replacing the inner product with a kernel function: g(\boldsymbol{x}_i) = \beta_0 + \sum_{j=1}^p\beta_jx_{ij} = \beta_0 + \sum_{k=1}^n\alpha_ky_k\color{#006DAE}{\mathcal{K}(\boldsymbol{x}_i, \boldsymbol{x}_k)}.
Under this representation, we don’t need to manually enlarge the feature space.
Instead we choose a kernel function.

Kernel functions

A kernel function is an inner product of vectors mapped to a (higher dimensional) feature space \mathcal{K}(\boldsymbol{x}_i, \boldsymbol{x}_k) = \langle \psi(\boldsymbol{x}_i), \psi(\boldsymbol{x}_j) \rangle \psi: \mathbb{R}^p \rightarrow \mathbb{R}^d where d > p.

Examples of kernels

Standard kernels include: \begin{align*} \text{Linear} \quad \mathcal{K}(\boldsymbol{x}_i, \boldsymbol{x}_k) & = \langle\boldsymbol{x}_i, \boldsymbol{x}_k \rangle \\ \text{Polynomial} \quad \mathcal{K}(\boldsymbol{x}, \boldsymbol{y}) & = (\langle\boldsymbol{x}_i, \boldsymbol{x}_k \rangle + 1)^d \\ \text{Radial} \quad \mathcal{K}(\boldsymbol{x}_i, \boldsymbol{x}_k) & = \exp(-\gamma\lvert\lvert\boldsymbol{x}_i-\boldsymbol{x}_k\lvert\lvert^2) \end{align*}

An application to breast cancer diagnosis

Breast cancer diagnosis

Code

library(tidyverse)
cancer <- read_csv("https://emitanaka.org/iml/data/cancer.csv") %>% 
  mutate(diagnosis_malignant = ifelse(diagnosis=="M", 1, 0),
         diagnosis = factor(diagnosis, levels = c("B", "M"))) %>% 
  janitor::clean_names()

cancer %>% 
  ggplot(aes(radius_mean, concave_points_mean)) +
  geom_point(alpha = 0.25, size = 2, aes(color = diagnosis)) +
  scale_color_manual(values = c("forestgreen", "red2")) +
  labs(x = "Average radius",
       y = str_wrap("Average concave portions of the contours", 20),
       color = "Diagnosis")

Support vector machine with R

library(e1071)
svmfitl <- svm(diagnosis ~ radius_mean + concave_points_mean,
              data = cancer,
              kernel = "linear",
              cost = 1,
              scale = FALSE)

cost is related to C
Large cost tolerates less classification error.
Small cost tolerates more classification error.

Main output of `svm`

svmfitl$index: The indexes of the support vectors.
svmfitl$coefs: The y_k\alpha_k from the dual representation.
svmfitl$SV: The support vectors.
svmfitl$b: Negative intercept -\hat{\beta}_0.
For linear kernels, \hat{\beta}_k can be calculated as t(svmfitl$coefs) %*% svmfitl$SV.

Classification plot for linear kernel

Code

cancer_grid <- expand_grid(radius_mean = seq(min(cancer$radius_mean),
                                             max(cancer$radius_mean),
                                             length.out = 50),
                           concave_points_mean = seq(min(cancer$concave_points_mean),
                                                     max(cancer$concave_points_mean),
                                                     length.out = 50)) %>% 
  mutate(predl = predict(svmfitl, .))

cancer %>% 
  mutate(support_vector = 1:n() %in% svmfitl$index) %>% 
  ggplot(aes(radius_mean, concave_points_mean)) +
  geom_tile(data = cancer_grid, aes(fill = predl), alpha = 0.5) +
  geom_point(alpha = 0.5, size = 3, aes(color = diagnosis, shape = support_vector)) +
  scale_color_manual(values = c("forestgreen", "red2")) +
  labs(x = "Average radius",
       y = str_wrap("Average concave portions of the contours", 20),
       color = "Diagnosis",
       shape = "Support vector",
       fill = "Prediction",
       title = "Linear kernel")

Support vector machine with other kernels

Polynomial kernel

svmfitp <- svm(diagnosis ~ radius_mean + concave_points_mean,
               data = cancer,
               kernel = "polynomial",
               cost = 1,
               scale = FALSE)

Radial kernel:

svmfitr <- svm(diagnosis ~ radius_mean + concave_points_mean,
               data = cancer,
               kernel = "radial",
               cost = 1,
               scale = FALSE)

Classification plots for polynomial & radial kernels

Code

library(patchwork)
cancer_grid <- cancer_grid %>% 
  mutate(predp = predict(svmfitp, .),
         predr = predict(svmfitr, .))

p1 <- cancer %>% 
  mutate(support_vector = 1:n() %in% svmfitp$index) %>% 
  ggplot(aes(radius_mean, concave_points_mean)) +
  geom_tile(data = cancer_grid, aes(fill = predp), alpha = 0.5) +
  geom_point(alpha = 0.5, size = 3, aes(color = diagnosis, shape = support_vector)) +
  scale_color_manual(values = c("forestgreen", "red2")) +
  labs(x = "Average radius",
       y = str_wrap("Average concave portions of the contours", 20),
       color = "Diagnosis",
       shape = "Support vector",
       fill = "Prediction",
       title = "Polynomial kernel")

p2 <- cancer %>% 
  mutate(support_vector = 1:n() %in% svmfitr$index) %>% 
  ggplot(aes(radius_mean, concave_points_mean)) +
  geom_tile(data = cancer_grid, aes(fill = predr), alpha = 0.5) +
  geom_point(alpha = 0.5, size = 3, aes(color = diagnosis, shape = support_vector)) +
  scale_color_manual(values = c("forestgreen", "red2")) +
  labs(x = "Average radius",
       y = "",
       color = "Diagnosis",
       shape = "Support vector",
       fill = "Prediction",
       title = "Radial kernel")


p1 + p2 + plot_layout(guides = "collect")

Limitations of support vector machine methods

The optimisation problem in the support vector machine methods is hard to solve for large sample size.
It often fails when class overlap in the feature space is large.
It does not have statistical foundation.

Takeaways

There are three types of support vector machine methods:
- Maximal margin classifier is for when the data is perfectly separated by a hyperplane
- Support vector classifier/soft margin classifier for when data is not perfectly separated by a hyperplane but still has a linear decision boundary, and
- Support vector machines used for when the data has nonlinear decision boundaries.

ETC3250/5250

Support vector machines

Hyperplane

Hyperplane in 2D

Hyperplane in 3D

Separating hyperplanes

Example

SVM disambiguation

Maximal marginal classifier

Toy breast cancer diagnosis data

Infinite hyperplanes

Distance to hyperplane

Margin for hyperplane (a)

Margin for hyperplane (b)

Margin for hyperplane (c)

Maximal margin classifier

Support vectors

Maximal margin hyperplane

Non-separable case

Support vector classifier

Support vector classifier

Optimisation

Tuning parameter C

Nonlinear boundaries

Enlarging the feature space

Support vector machines

Support vector machines

The inner product

Dual representation

Dual representation

Kernel functions

Examples of kernels

An application to breast cancer diagnosis

Breast cancer diagnosis

Support vector machine with R

Main output of svm

Classification plot for linear kernel

Support vector machine with other kernels

Classification plots for polynomial & radial kernels

Limitations of support vector machine methods

Takeaways

Main output of `svm`