ETC3250/5250

Introduction to Machine Learning

Support vector machines

Lecturer: Emi Tanaka

Department of Econometrics and Business Statistics

Hyperplane

  • In a p-dimensional space, a hyperplane is a linear subspace of dimension p - 1.

  • A p-dimensional hyperplane is given as

\beta_0 + \beta_1 x_1 + \dots + \beta_p x_p = 0.

Hyperplane in 2D

  • For p = 2, a hyperplane is a line.

Hyperplane in 3D

  • For p = 3, a hyperplane is a plane.

Separating hyperplanes

  • A hyperplane divides the p-space into two sides.
  • Suppose we have a new observation \boldsymbol{x}_0 = (x_{01}, x_{02}, \dots, x_{0p})^\top with g(x_0) = \beta_0 + \beta_1 x_{01} + \dots + \beta_p x_{0p}.
    • \boldsymbol{x}_0 is on side 1 if g(\boldsymbol{x}_0) > 0.
    • \boldsymbol{x}_0 is on side 2 if g(\boldsymbol{x}_0) < 0.
    • \boldsymbol{x}_0 is on the hyperplane if g(\boldsymbol{x}_0) = 0.
  • The magnitude of g(\boldsymbol{x}_0) tells us how far from \boldsymbol{x}_0 is from the hyperplane.

Example

  • Suppose we have g(\boldsymbol{x}) = 1 + x_{1} - x_2.
  • Consider the points \boldsymbol{p}_1 = (0, 0)^\top and \boldsymbol{p}_2 = (0, 2)^\top.

  • \boldsymbol{p}_1 is on side 1.
  • \boldsymbol{p}_2 is on side 2.

SVM disambiguation

  • Support vector machine methods refer to three main approaches:
    • Maximal margin classifier.
    • Support vector classifier.
    • Support vector machines.
  • To avoid confusion, we call these as support vector machine methods and reserve the name support vector machines for the third approach.

Maximal marginal classifier

Toy breast cancer diagnosis data

Infinite hyperplanes

  • There are infinite number of hyperplanes to perfectly classify these points. E.g.,
    1. 28 - x_{1} - x_{2} = 0
    2. 30 - 1.4x_{1} - x_{2} = 0
    3. 18.5 - 0.5x_{1} - x_{2} = 0
  • All three do a perfect job to separate the classes.
  • Which one should be we choose?

Distance to hyperplane

  • We compute the (perpendicular) distance from each training observation to a given separating hyperplane.
  • The smallest such distance is known as the margin (M).
  • The maximal margin hyperplane is the separating hyperplane for which the margin is largest.
  • We can then classify a test observation based on which side of the maximal margin hyperplane it lies.
  • This is known as the maximal margin classifier.

Margin for hyperplane (a)

The margin for this hyperplane is 1.4.

Margin for hyperplane (b)

The margin for this hyperplane is 1.02.

Margin for hyperplane (c)

The margin for this hyperplane is 0.43.

Maximal margin classifier

The maximal margin hyperplane is the plane that correctly classifies all observations but also is farthest away from them.

Support vectors

  • There will always be at least two equidistant vectors vectors to the maximal margin hyperplane.
  • These are called support vectors – there are three in this example.

Maximal margin hyperplane

  • Suppose that y_i = \begin{cases}1 & \text{if observation $i$ in class 1}\\-1& \text{if observation $i$ in class 2}\end{cases}.
  • The maximal margin hyperplane is found from solving the optimisation problem: \hat{\boldsymbol{\beta}} = \arg\max_{\boldsymbol{\beta}\in\mathbb{R}^{p+1}}M subject to \sum_{j=1}^p \beta_j^2 = 1 and y_i \times g(\boldsymbol{x}_i) \geq M.

Non-separable case

  • The maximal margin classifier only works when we have perfect separability in our data.

  • What do we do if data is not perfectly separable by a hyperplane?

  • The support vector classifier allows points to either lie on the wrong side of the margin, or on the wrong side of the hyperplane altogether.

Support vector classifier

Support vector classifier

  • Support vector classifier allows some observation to be closer to the hyperplane than the support vectors.
  • It also allows some observations to be on the incorrect side of the hyperplane (i.e. allows for some misclassification).
  • It will try to classify the remaining observations to be classified correctly.

Optimisation

  • For support vector classification, we solve the optimisation problem: \hat{\boldsymbol{\beta}} = \arg\max_{\boldsymbol{\beta}\in\mathbb{R}^{p+1}}M subject to \sum_{j=1}^p \beta_j^2 = 1 and y_i \times g(\boldsymbol{x}_i) \geq M(1 - \varepsilon_i).
  • \varepsilon_i \geq 0 is called the slack variable and \sum_{i=1}^n\varepsilon_i \leq C.
    • If 0 < \varepsilon_i \leq 1, observation i is correctly classified but it violates the margin.
    • If \varepsilon_i > 1, observation is incorrectly classified.

Tuning parameter C

  • C restricts the magnitude of \varepsilon_i.
  • If C = 0, then all \varepsilon_i = 0 thus support vector classifier is equivalent to maximum margin classifier.
  • You can select C using cross-validation.

Nonlinear boundaries

  • The support vector classifier is a linear classifier.
  • It doesn’t work well for nonlinear boundaries.

Enlarging the feature space

  • We can make support vector classifier more flexible by adding higher order polynomial terms, e.g. x_1^2 and x_2^3.
  • We treat these new terms as predictors (referred to as enlarging the feature space).

Support vector machines

Support vector machines

  • Adding polynomial terms makes support vector classifiers more flexible, but we have to explicitly specify this apriori.
  • Support vector machines enlarges the feature space without explicit specification of nonlinear terms apriori.
  • To understand support vector machines, we need to know about:
    • the inner product,
    • the dual representation, and
    • kernel functions.

The inner product

  • Consider two p-vectors \begin{align*} \boldsymbol{x}_i & = (x_{i1}, x_{i2}, \dots, x_{ip}) \in \mathbb{R}^p \\ \text{and} \quad \boldsymbol{x}_k & = (x_{k1}, x_{k2}, \dots, x_{kp}) \in \mathbb{R}^p. \end{align*}
  • The inner product is defined as

\langle \boldsymbol{x}_i, \boldsymbol{x}_k\rangle = \boldsymbol{x}_i^\top \boldsymbol{x}_k = x_{i1}x_{k1} + x_{i2}x_{k2} + \dots + x_{ip}x_{kp} = \sum_{j=1}^{p} x_{ij}x_{kj}.

Dual representation

  • In support vector machines, we write \beta_j = \sum_{k=1}^n\alpha_ky_kx_{kj}.
  • This re-expresses g(\boldsymbol{x}) as g(\boldsymbol{x}_i) = \beta_0 + \sum_{j=1}^p\beta_jx_{ij} = \beta_0 + \sum_{k=1}^n\alpha_ky_k\langle\boldsymbol{x}_i, \boldsymbol{x}_k\rangle.

Dual representation

  • We can generalise the expression by replacing the inner product with a kernel function: g(\boldsymbol{x}_i) = \beta_0 + \sum_{j=1}^p\beta_jx_{ij} = \beta_0 + \sum_{k=1}^n\alpha_ky_k\color{#006DAE}{\mathcal{K}(\boldsymbol{x}_i, \boldsymbol{x}_k)}.
  • Under this representation, we don’t need to manually enlarge the feature space.
  • Instead we choose a kernel function.

Kernel functions

  • A kernel function is an inner product of vectors mapped to a (higher dimensional) feature space \mathcal{K}(\boldsymbol{x}_i, \boldsymbol{x}_k) = \langle \psi(\boldsymbol{x}_i), \psi(\boldsymbol{x}_j) \rangle \psi: \mathbb{R}^p \rightarrow \mathbb{R}^d where d > p.

Examples of kernels

  • Standard kernels include: \begin{align*} \text{Linear} \quad \mathcal{K}(\boldsymbol{x}_i, \boldsymbol{x}_k) & = \langle\boldsymbol{x}_i, \boldsymbol{x}_k \rangle \\ \text{Polynomial} \quad \mathcal{K}(\boldsymbol{x}, \boldsymbol{y}) & = (\langle\boldsymbol{x}_i, \boldsymbol{x}_k \rangle + 1)^d \\ \text{Radial} \quad \mathcal{K}(\boldsymbol{x}_i, \boldsymbol{x}_k) & = \exp(-\gamma\lvert\lvert\boldsymbol{x}_i-\boldsymbol{x}_k\lvert\lvert^2) \end{align*}

An application to breast cancer diagnosis

Breast cancer diagnosis

Code
library(tidyverse)
cancer <- read_csv("https://emitanaka.org/iml/data/cancer.csv") %>% 
  mutate(diagnosis_malignant = ifelse(diagnosis=="M", 1, 0),
         diagnosis = factor(diagnosis, levels = c("B", "M"))) %>% 
  janitor::clean_names()

cancer %>% 
  ggplot(aes(radius_mean, concave_points_mean)) +
  geom_point(alpha = 0.25, size = 2, aes(color = diagnosis)) +
  scale_color_manual(values = c("forestgreen", "red2")) +
  labs(x = "Average radius",
       y = str_wrap("Average concave portions of the contours", 20),
       color = "Diagnosis")

Support vector machine with R

library(e1071)
svmfitl <- svm(diagnosis ~ radius_mean + concave_points_mean,
              data = cancer,
              kernel = "linear",
              cost = 1,
              scale = FALSE)
  • cost is related to C
  • Large cost tolerates less classification error.
  • Small cost tolerates more classification error.

Main output of svm

  • svmfitl$index: The indexes of the support vectors.
  • svmfitl$coefs: The y_k\alpha_k from the dual representation.
  • svmfitl$SV: The support vectors.
  • svmfitl$b: Negative intercept -\hat{\beta}_0.
  • For linear kernels, \hat{\beta}_k can be calculated as t(svmfitl$coefs) %*% svmfitl$SV.

Classification plot for linear kernel

Code
cancer_grid <- expand_grid(radius_mean = seq(min(cancer$radius_mean),
                                             max(cancer$radius_mean),
                                             length.out = 50),
                           concave_points_mean = seq(min(cancer$concave_points_mean),
                                                     max(cancer$concave_points_mean),
                                                     length.out = 50)) %>% 
  mutate(predl = predict(svmfitl, .))

cancer %>% 
  mutate(support_vector = 1:n() %in% svmfitl$index) %>% 
  ggplot(aes(radius_mean, concave_points_mean)) +
  geom_tile(data = cancer_grid, aes(fill = predl), alpha = 0.5) +
  geom_point(alpha = 0.5, size = 3, aes(color = diagnosis, shape = support_vector)) +
  scale_color_manual(values = c("forestgreen", "red2")) +
  labs(x = "Average radius",
       y = str_wrap("Average concave portions of the contours", 20),
       color = "Diagnosis",
       shape = "Support vector",
       fill = "Prediction",
       title = "Linear kernel")

Support vector machine with other kernels

  • Polynomial kernel
svmfitp <- svm(diagnosis ~ radius_mean + concave_points_mean,
               data = cancer,
               kernel = "polynomial",
               cost = 1,
               scale = FALSE)
  • Radial kernel:
svmfitr <- svm(diagnosis ~ radius_mean + concave_points_mean,
               data = cancer,
               kernel = "radial",
               cost = 1,
               scale = FALSE)

Classification plots for polynomial & radial kernels

Code
library(patchwork)
cancer_grid <- cancer_grid %>% 
  mutate(predp = predict(svmfitp, .),
         predr = predict(svmfitr, .))

p1 <- cancer %>% 
  mutate(support_vector = 1:n() %in% svmfitp$index) %>% 
  ggplot(aes(radius_mean, concave_points_mean)) +
  geom_tile(data = cancer_grid, aes(fill = predp), alpha = 0.5) +
  geom_point(alpha = 0.5, size = 3, aes(color = diagnosis, shape = support_vector)) +
  scale_color_manual(values = c("forestgreen", "red2")) +
  labs(x = "Average radius",
       y = str_wrap("Average concave portions of the contours", 20),
       color = "Diagnosis",
       shape = "Support vector",
       fill = "Prediction",
       title = "Polynomial kernel")

p2 <- cancer %>% 
  mutate(support_vector = 1:n() %in% svmfitr$index) %>% 
  ggplot(aes(radius_mean, concave_points_mean)) +
  geom_tile(data = cancer_grid, aes(fill = predr), alpha = 0.5) +
  geom_point(alpha = 0.5, size = 3, aes(color = diagnosis, shape = support_vector)) +
  scale_color_manual(values = c("forestgreen", "red2")) +
  labs(x = "Average radius",
       y = "",
       color = "Diagnosis",
       shape = "Support vector",
       fill = "Prediction",
       title = "Radial kernel")


p1 + p2 + plot_layout(guides = "collect")

Limitations of support vector machine methods

  • The optimisation problem in the support vector machine methods is hard to solve for large sample size.
  • It often fails when class overlap in the feature space is large.
  • It does not have statistical foundation.

Takeaways

  • There are three types of support vector machine methods:
    • Maximal margin classifier is for when the data is perfectly separated by a hyperplane
    • Support vector classifier/soft margin classifier for when data is not perfectly separated by a hyperplane but still has a linear decision boundary, and
    • Support vector machines used for when the data has nonlinear decision boundaries.