ETC3250/5250: Introduction to Machine Learning

Bayes theorem

Let $f_k(x)$ be the density function for predictor $x$ for class $k$ .
Bayes theorem (for $K$ classes) states that the posterior probability: $P(Y = k|X = x) = p_k(x) = \frac{\pi_kf_k(x)}{\sum_{k=1}^K \pi_kf_k(x)}$ where $\pi_k = P(Y = k)$ is the prior probability that the observation comes from class $k$ .

Univariate Normal (Gaussian) distribution

$f(x) = \frac{1}{\sqrt{2 \pi} \sigma} \text{exp}~ \left( - \frac{1}{2 \sigma^2} (x - \mu)^2 \right).$

Code

tibble(x = seq(-5, 7, length = 100)) %>% 
  mutate(f1 = dnorm(x, 2, 1),
         f2 = dnorm(x, -1, 1),
         f3 = dnorm(x, 3, 2)) %>% 
  pivot_longer(f1:f3, values_to = "f") %>% 
  ggplot(aes(x, f)) + 
  geom_line(aes(color = name), linewidth = 1.2) +
  labs(y = "f(x)", color = "") +
  scale_color_discrete(labels = c("N(2, 1)", "N(-1, 1)", "N(3, 4)"))

Multivariate Normal (Gaussian) distribution

$f(\boldsymbol{x}) = \frac{1}{(2\pi)^{\frac{p}{2}}|\mathbf{\Sigma}|^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\boldsymbol{x}-\boldsymbol{\mu})^\top\mathbf{\Sigma}^{-1}(\boldsymbol{x}-\boldsymbol{\mu})\right)$

$\boldsymbol{\mu}$ is a $p$ -vector of means, and
$\mathbf{\Sigma}$ is a $p\times p$ variance-covariance matrix.

Code

expand_grid(x1 = seq(-5, 5, length = 100),
            x2 = seq(-5, 5, length = 100)) %>% 
  rowwise() %>% 
  mutate(f1 = mvtnorm::dmvnorm(c(x1, x2), mean = c(2, 1), sigma = matrix(c(1, 0.5, 0.5, 1), 2, 2)),
         f2 = mvtnorm::dmvnorm(c(x1, x2), mean = c(1, 2), sigma = matrix(c(1, 0, 0, 1), 2, 2)),,
         f3 = mvtnorm::dmvnorm(c(x1, x2), mean = c(-1, 1), sigma = matrix(c(2, 0.5, 0.5, 1), 2, 2)),) %>% 
  pivot_longer(f1:f3, values_to = "f") %>% 
  ggplot(aes(x1, x2, z = f)) + 
  geom_contour(aes(color = name), linewidth = 1.2) +
  labs(color = "") +
  guides(color = "none")

LDA

Recall, we assume that $\boldsymbol{X}_{k} \sim N(\boldsymbol{\mu}_{k}, \mathbf{\Sigma}_k)$ for $k = 1, \dots, K$ .
In LDA, we further assume that $\mathbf{\Sigma}_1 = \mathbf{\Sigma}_2 = \dots = \mathbf{\Sigma}_k = \mathbf{\Sigma}$ .
Then by Bayes theorem, the posterior is given as: $p_k(\boldsymbol{x}) = \frac{\pi_k\exp~ \left( - \frac{1}{2} (\boldsymbol{x} - \boldsymbol{\mu}_k)\mathbf{\Sigma}^{-1}(\boldsymbol{x} - \boldsymbol{\mu}_k) \right) }{ \sum_{k = 1}^K \pi_k \exp~ \left( - \frac{1}{2} (\boldsymbol{x} - \boldsymbol{\mu}_k)\mathbf{\Sigma}^{-1}(\boldsymbol{x} - \boldsymbol{\mu}_k) \right) }.$
We assign new observation $\boldsymbol{X} = \boldsymbol{x}_0$ to the class with the highest $p_k(\boldsymbol{x}_0)$ , referred to as the maximum a posteriori estimation (MAP).

LDA with R: $\boldsymbol{\mu}_k$

The LDA model can be fit using lda() via the maximum likelihood estimate in the MASS package as:

cancer_lda <- MASS::lda(diagnosis ~ radius_mean + concave_points_mean, 
                        data = cancer_train, 
                        method = "mle")

The estimate of $\begin{bmatrix}\boldsymbol{\mu}_{\text{A}}^\top\\ \boldsymbol{\mu}_{\text{B}}^\top \end{bmatrix}$ is found:

cancer_lda$means

  radius_mean concave_points_mean
M    17.71993          0.08796497
B    12.20888          0.02551768

LDA with R: $\mathbf{\Sigma}$

The estimate of $\mathbf{\Sigma}$ is found:

cancer_lda_sigma <- cancer_train %>% 
  # first center predictors based on MLE of class means
  mutate(x1 = radius_mean - cancer_lda$means[diagnosis, "radius_mean"],
         x2 = concave_points_mean - cancer_lda$means[diagnosis, "concave_points_mean"]) %>% 
  select(x1, x2) %>% 
  # find the sample variance-covariance matrix
  var()

cancer_lda_sigma

           x1           x2
x1 5.39571618 0.0348868055
x2 0.03488681 0.0005928822

QDA with R: $\boldsymbol{\mu}_k$

The QDA model can be fit using qda() via the maximum likelihood estimate in the MASS package as:

cancer_qda <- MASS::qda(diagnosis ~ radius_mean + concave_points_mean, 
                        data = cancer_train, 
                        method = "mle")

The estimate of $\begin{bmatrix}\boldsymbol{\mu}_{\text{A}}^\top\\ \boldsymbol{\mu}_{\text{B}}^\top \end{bmatrix}$ is found:

cancer_qda$means

  radius_mean concave_points_mean
M    17.71993          0.08796497
B    12.20888          0.02551768

QDA with R: $\mathbf{\Sigma}$

The estimate of $\mathbf{\Sigma}$ is found:

cancer_qda_sigma <- cancer_train %>% 
  # first center predictors based on MLE of class means
  mutate(x1 = radius_mean - cancer_qda$means[diagnosis, "radius_mean"],
         x2 = concave_points_mean - cancer_qda$means[diagnosis, "concave_points_mean"]) %>% 
  group_by(diagnosis) %>% 
  summarise(sigma = list(var(data.frame(x1, x2))))


cancer_qda_sigma$sigma[[1]]

           x1          x2
x1 9.91949800 0.076928052
x2 0.07692805 0.001252928

cancer_qda_sigma$sigma[[2]]

           x1          x2
x1 2.93888569 0.011998849
x2 0.01199885 0.000233707

1 / 19

No notes on this slide.

Discriminant analysis - table of contents ETC3250/5250 Discriminant analysis Discriminant analysis Logistic regression vs LDA vs QDA DA in a nutshell Background Bayes theorem Univariate Normal (Gaussian) distribution Multivariate Normal (Gaussian) distribution Linear Discriminant Analysis LDA LDA with R: \boldsymbol{\mu}_k LDA with R: \mathbf{\Sigma} Quadratic Discriminant Analysis QDA with R: \boldsymbol{\mu}_k QDA with R: \mathbf{\Sigma} Predictions from dicriminant models Classification metrics Takeaways