ETC3250/5250

Introduction to Machine Learning

Neural network I

Lecturer: Emi Tanaka

Department of Econometrics and Business Statistics

Supervised learning

  • So far you have learnt about:
    • linear, non-linear & logistic regression
    • linear & quadratic discriminant analysis
    • decision trees
    • tree-ensemble methods (bagging, boosting, random forest)
    • k-nearest neighbours
    • support vector machine methods
  • But these methods still don’t perform well for some tasks, e.g. image recognition.

Human brain

  • Our brains can dissect and process features of images, e.g. the shape, object, lighting, etc.

  • The human brain is made of billions of neurons that communicate via electrochemical signals.
  • So how do we mimic this in a program?

Artificial neuron

Biological neuron model

  • Artificial neural network, or often referred to just as neural network, was inspired by the biological neural network.
  • In a biological neural network, a collection of neurons interconnected by synapses carry out a specific function when activated.
  • The dendrites receive synaptic inputs and propogate electrochemical stimulation to the cell body - if stimulated enough, a neuron fires an action potential (synaptic inputs for other neurons).

Artificial neuron: dendrites

  • The artificial neuron is the elementary units of an artificial neural network.
  • The artificial neuron receives predictors \boldsymbol{x}_i = (1, x_{i1}, \dots,x_{ip})^\top that is typically combined as a weighted sum:

z_i = \beta_0 + \sum_{j=1}^p\beta_jx_{ij} = \boldsymbol{\beta}^\top\boldsymbol{x}_i, \quad\text{where }\boldsymbol{\beta} = (\beta_0, \beta_1, \dots, \beta_p)^\top.

Artificial neuron: action potential

  • The z_i then get passed into the activation function, h(z_i).
  • Suppose a is the location parameter and w is the scale parameter.
  • Some common choices include:
    • Heaviside step: h(z_i) = \mathbb{I}(z_i > 0),
    • Linear: h(z_i) = z_i,
    • Sigmoid: h(z_i|a,w) = a + w(1+e^{-z_i})^{-1},
    • Tanh: h(z_i|a,w) = a + w \left(\frac{2}{1+e^{-2z_i}} - 1 \right), and
    • ReLU: h(z_i|a,w) = a + w \times \max(0, z_i).

Visualising artifical neuron

  • When x_1 = 1 and x_2 = 3, then \begin{align*}z &= \beta_0 + \beta_1x_1 + \beta_2 x_2\\ &= 1 + 0.5 \times 1 - 3\times 3 = -7.5.\end{align*}

  • Using ReLU with a = 1, w = 2, the prediction is 1 + 2 \times \max(0, z) = 1.

Predicting from an artificial neuron

Your turn!

  • What is the prediction when x_1 = 10?

3

  • What is the prediction when x_1 = -5?

-11

Activation function

Heaviside step

h(z_i) = \mathbb{I}(z_i > 0)

  • This is also known as the perceptron and is used for classification.
  • For example: h(1 + 2x_i) and h(-12 + 4x_i).

Linear

h(z_i) = z_i

  • This is a regression model!
  • For example: h(1 + 2x_i) and h(-12 + 4x_i).

Sigmoid (Logistic)

h(z_i|a,w) = a + w(1+e^{-z_i})^{-1}

  • When a = 0 and w = 1, then 0 < h(z_i) < 1 for finite z_i.
  • E.g. h(1 + 2x_i|a = 0, w = 1) and h(-12 + 4x_i|a = -1, w = 2).

Hyperbolic tangent (Tanh)

h(z_i) = a + w \left(\frac{2}{1+e^{-2z_i}} - 1 \right)

  • Similar to Sigmoid.
  • E.g. h(1 + 2x_i|a = 0, w = 1) and h(-12 + 4x_i|a = -1, w = 2).

Rectified linear unit (ReLU)

h(z_i) = a + w \times \max(0, z_i)

  • Sigmoid and Tanh are largely replaced by rectified linear unit (ReLU).
  • E.g. h(1 + 2x_i|a = 0, w = 1) and h(-12 + 4x_i|a = -1, w = 2).

Neural network

Limitations of a single artificial neuron

  • Biological neurons are interconnected in complex networks that allow the brain to perform a wide range of functions.
  • A single artificial neuron has a limited ability to model complex relationships so much like its biological counterpart, we can connect different artificial neurons to model complex relationships.
  • An artifical neural network, the interconnection of the artificial neurons, began by mimicing the architecture of brain activity.
  • But neural network is no longer true to their biological counterpart and its development is motivated by empirical results.

Multiple artificial neurons

  • When combining artificial neurons, we always set a = 0 and w = 1 for Sigmoid, Tanh, and ReLU.
Artificial neuron 1
Artificial neuron 2

Combining artificial neurons for regression

  • In general, we combine K neurons as f(\boldsymbol{x}_i) = b + \sum_{k=1}^Kw_kh(\boldsymbol{\beta}_k^\top\boldsymbol{x}_i)
    • \boldsymbol{\beta}_k is the coefficient of predictors in the k-th artificial neuron,
    • b is called the bias,
    • w_k is the weight corresponding to the k-th neuron.
  • The activation function h is always of the same type.

Example of combining artificial neurons

f(\boldsymbol{x}_i) = b + w_1{\color{#027EB6}{h(\boldsymbol{\beta}_1^\top\boldsymbol{x}_i)}} + w_2\color{#EE0220}{h(\boldsymbol{\beta}_2^\top\boldsymbol{x}_i)}

This represents a neural network with

  • 2 nodes in the input layer,
  • 2 nodes in the middle layer, and
  • 1 node in the output layer with parameters:
    • b = 0,
    • w_1 = 0.5,
    • w_2 = 0.9,
    • \boldsymbol{\beta}_1 = (3, -5)^\top,
    • \boldsymbol{\beta}_2 = (1, 0.5)^\top, and
    • h is the ReLU activation function with a = 0 and w = 1.

Combining artificial neurons for classification

  • The output of the previous example is only applicable for regression problems.
  • We can easily modify this for classfication by changing the output layers to say the Sigmoid function, which gives a numerical value between 0 and 1 and can be thought of as the propensity score. P(y_i = 1 | \boldsymbol{x}_i) = \frac{1}{1 + \exp\left(-(b + \sum_{k=1}^Kw_kh(\boldsymbol{\beta}_k^\top\boldsymbol{x}_i))\right)}.

Regression vs classification

  • In a neural network, initial layers can be identical for regression or classification.
  • It is the output layer that determines if it can be used for regression or classification!

Multi-class classification

Multi-class classification

  • The Sigmoid functions allows for computation of propensity scores for binary outcomes.
  • If you have more than two classes in your response, then you need to convert it to dummy variables, or otherwise referred to as one-hot encoding.
  • So for a categorical variable with m levels, y_i \in \{\text{Class 1}, \dots, \text{Class }m\}, we convert it as: y_{ik} = \begin{cases}1 & \text{if } y_i = \text{Class}k\\0 & \text{if } y_i \neq \text{Class}k\end{cases}

From categorical variable to dummy variables

dat <- tibble(pet = c("cat", "dog", "cat", "fish"))
model.matrix(~ pet - 1, data = dat)
  petcat petdog petfish
1      1      0       0
2      0      1       0
3      1      0       0
4      0      0       1
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$pet
[1] "contr.treatment"

Alternatively,

class_levels <- as.numeric(factor(dat$pet)) - 1 # changes to cat = 0, dog = 1, fish = 2 
keras::to_categorical(class_levels, num_classes = 3)
     [,1] [,2] [,3]
[1,]    1    0    0
[2,]    0    1    0
[3,]    1    0    0
[4,]    0    0    1

Softmax activation function

  • The Sigmoid function only works for m = 2.

  • For m > 2, we can use the Softmax activation function instead: P(y_{ij} = 1 | \boldsymbol{x}_i) = \frac{\exp(\boldsymbol{\beta}_j^\top\boldsymbol{x}_i)}{\sum_{j=1}^m\exp\left(\boldsymbol{\beta}_j^\top\boldsymbol{x}_i\right)}.

  • The number of neurons for the Softmax layer must be m.

  • Note that \sum_{j=1}^m P(y_{ij} = 1 | \boldsymbol{x}_i) = 1.

An illustration of a Softmax layer

  • Suppose we have the income of a customer in thousands of dollar.
  • We want to predict if the customer will buy a “cheap”, “average” or “expensive” brand of clothing brand.
  • What is the probability that customer with an income of $45K will buy a cheap brand based on this trained neural network?

Solution

Layer 2

  • ReLU: \max(0, 49.62 - 0.37 \times 45) = 32.97
  • ReLU: \max(0, 27.62 - 0.19 \times 45) = 19.07
  • ReLU: \max(0, -1.72 + 0.35 \times 45) = 14.03

Output layer

  • Cheap: \frac{\exp(32.97)}{\exp(32.97) + \exp(19.07) + \exp(14.03)} = 0.9999991
  • Average: \frac{\exp(19.07)}{\exp(32.97) + \exp(19.07) + \exp(14.03)} = 0.00000092
  • Expensive: \frac{\exp(14.03)}{\exp(32.97) + \exp(19.07) + \exp(14.03)} = 0.0000000059

Prediction: The customer will buy the cheap brand.

Building a neural network structure with R

Installing keras

  • Keras is an open-source software library that uses the TensorFlow library to fit artificial neural networks.
  • We can use the Keras library through the keras package in R.
  • To install keras, run the following commands:
install.packages("keras")
library(keras)
install_keras(method = c("conda"), 
              conda = "auto", 
              tensorflow = "default",
              extra_packages = "tensorflow-hub")
  • Be warned that the installation poses issues often due to keras looking at the wrong location for the Keras library.

Building a neural network structure in R

library(keras)
model <- keras_model_sequential() %>%
  layer_dense(units = 3,
              input_shape = 2,
              activation = "relu") %>% 
  layer_dense(units = 3, 
              activation = "softmax")
  • keras_model_sequential() must be used first to initialise the architechture.
  • layer_dense() indicates a new layer with:
    • units indicating the number of neurons in that layer,
    • input_shape indicating the number of predictors (only needed for the first layer_dense()),
    • activation specifying the activation function for that layer.

Examining the weights and biases

  • You can extract the weights and biases using get_weights():
get_weights(model)
[[1]]
           [,1]       [,2]       [,3]
[1,]  0.4419321 -0.3501425 -0.0361855
[2,] -0.9320850 -0.9673537  0.7750421

[[2]]
[1] 0 0 0

[[3]]
          [,1]       [,2]      [,3]
[1,] 0.8876681 -0.5565016 0.4099801
[2,] 0.8715661  0.2542915 0.3897784
[3,] 0.9583607  0.3416290 0.6392198

[[4]]
[1] 0 0 0
  • Every even entry is the bias of the nodes (here all 0s).
  • The weights are given in every odd entry in the order of the layers.

Manually setting the weights

w <- get_weights(model)
w[[1]] <- matrix(c(49.62, -0.37, 27.62, -0.19, -1.72, 0.35), nrow = 2)
w[[3]] <- diag(3)
set_weights(model, w)
get_weights(model)
[[1]]
      [,1]  [,2]  [,3]
[1,] 49.62 27.62 -1.72
[2,] -0.37 -0.19  0.35

[[2]]
[1] 0 0 0

[[3]]
     [,1] [,2] [,3]
[1,]    1    0    0
[2,]    0    1    0
[3,]    0    0    1

[[4]]
[1] 0 0 0
  • Let’s manually set the weights as the previous example:

Prediction from neural network model

  • Normally we need to train the model but this will be covered next week.
  • Suppose the manually set weights are a result from a trained model.
  • You can predict the probability a customer with an income of $45K will buy a cheap, average or expensive brand with:
predict(model, cbind(1, 45))
         [,1]        [,2]         [,3]
[1,] 0.999999 9.18979e-07 5.949233e-09
  • Note that you need to have the new data in a matrix format with the intercept!

Takeaways

  • Neural networks are flexible models that can be used for both regression and classification problems.
  • The activation function in the output layer determine if the neural network can be used for regression or classification.
  • More neural network to come next week!