## ETC3250/5250

Introduction to Machine Learning

### Neural network I

Lecturer: Emi Tanaka

Department of Econometrics and Business Statistics

## Supervised learning

• So far you have learnt about:
• linear, non-linear & logistic regression
• linear & quadratic discriminant analysis
• decision trees
• tree-ensemble methods (bagging, boosting, random forest)
• k-nearest neighbours
• support vector machine methods
• But these methods still don’t perform well for some tasks, e.g. image recognition.

## Human brain

• Our brains can dissect and process features of images, e.g. the shape, object, lighting, etc.
• The human brain is made of billions of neurons that communicate via electrochemical signals.
• So how do we mimic this in a program?

# Artificial neuron

## Biological neuron model

• Artificial neural network, or often referred to just as neural network, was inspired by the biological neural network.
• In a biological neural network, a collection of neurons interconnected by synapses carry out a specific function when activated.
• The dendrites receive synaptic inputs and propogate electrochemical stimulation to the cell body - if stimulated enough, a neuron fires an action potential (synaptic inputs for other neurons).

## Artificial neuron: dendrites

• The artificial neuron is the elementary units of an artificial neural network.
• The artificial neuron receives predictors \boldsymbol{x}_i = (1, x_{i1}, \dots,x_{ip})^\top that is typically combined as a weighted sum:

z_i = \beta_0 + \sum_{j=1}^p\beta_jx_{ij} = \boldsymbol{\beta}^\top\boldsymbol{x}_i, \quad\text{where }\boldsymbol{\beta} = (\beta_0, \beta_1, \dots, \beta_p)^\top.

## Artificial neuron: action potential

• The z_i then get passed into the activation function, h(z_i).
• Suppose a is the location parameter and w is the scale parameter.
• Some common choices include:
• Heaviside step: h(z_i) = \mathbb{I}(z_i > 0),
• Linear: h(z_i) = z_i,
• Sigmoid: h(z_i|a,w) = a + w(1+e^{-z_i})^{-1},
• Tanh: h(z_i|a,w) = a + w \left(\frac{2}{1+e^{-2z_i}} - 1 \right), and
• ReLU: h(z_i|a,w) = a + w \times \max(0, z_i).

## Visualising artifical neuron

• When x_1 = 1 and x_2 = 3, then \begin{align*}z &= \beta_0 + \beta_1x_1 + \beta_2 x_2\\ &= 1 + 0.5 \times 1 - 3\times 3 = -7.5.\end{align*}

• Using ReLU with a = 1, w = 2, the prediction is 1 + 2 \times \max(0, z) = 1.

## Predicting from an artificial neuron

• What is the prediction when x_1 = 10?

3

• What is the prediction when x_1 = -5?

-11

# Activation function

## Heaviside step

h(z_i) = \mathbb{I}(z_i > 0)

• This is also known as the perceptron and is used for classification.
• For example: h(1 + 2x_i) and h(-12 + 4x_i).

## Linear

h(z_i) = z_i

• This is a regression model!
• For example: h(1 + 2x_i) and h(-12 + 4x_i).

## Sigmoid (Logistic)

h(z_i|a,w) = a + w(1+e^{-z_i})^{-1}

• When a = 0 and w = 1, then 0 < h(z_i) < 1 for finite z_i.
• E.g. h(1 + 2x_i|a = 0, w = 1) and h(-12 + 4x_i|a = -1, w = 2).

## Hyperbolic tangent (Tanh)

h(z_i) = a + w \left(\frac{2}{1+e^{-2z_i}} - 1 \right)

• Similar to Sigmoid.
• E.g. h(1 + 2x_i|a = 0, w = 1) and h(-12 + 4x_i|a = -1, w = 2).

## Rectified linear unit (ReLU)

h(z_i) = a + w \times \max(0, z_i)

• Sigmoid and Tanh are largely replaced by rectified linear unit (ReLU).
• E.g. h(1 + 2x_i|a = 0, w = 1) and h(-12 + 4x_i|a = -1, w = 2).

# Neural network

## Limitations of a single artificial neuron

• Biological neurons are interconnected in complex networks that allow the brain to perform a wide range of functions.
• A single artificial neuron has a limited ability to model complex relationships so much like its biological counterpart, we can connect different artificial neurons to model complex relationships.
• An artifical neural network, the interconnection of the artificial neurons, began by mimicing the architecture of brain activity.
• But neural network is no longer true to their biological counterpart and its development is motivated by empirical results.

## Multiple artificial neurons

• When combining artificial neurons, we always set a = 0 and w = 1 for Sigmoid, Tanh, and ReLU.
Artificial neuron 1
Artificial neuron 2

## Combining artificial neurons for regression

• In general, we combine K neurons as f(\boldsymbol{x}_i) = b + \sum_{k=1}^Kw_kh(\boldsymbol{\beta}_k^\top\boldsymbol{x}_i)
• \boldsymbol{\beta}_k is the coefficient of predictors in the k-th artificial neuron,
• b is called the bias,
• w_k is the weight corresponding to the k-th neuron.
• The activation function h is always of the same type.

## Example of combining artificial neurons

f(\boldsymbol{x}_i) = b + w_1{\color{#027EB6}{h(\boldsymbol{\beta}_1^\top\boldsymbol{x}_i)}} + w_2\color{#EE0220}{h(\boldsymbol{\beta}_2^\top\boldsymbol{x}_i)}

This represents a neural network with

• 2 nodes in the input layer,
• 2 nodes in the middle layer, and
• 1 node in the output layer with parameters:
• b = 0,
• w_1 = 0.5,
• w_2 = 0.9,
• \boldsymbol{\beta}_1 = (3, -5)^\top,
• \boldsymbol{\beta}_2 = (1, 0.5)^\top, and
• h is the ReLU activation function with a = 0 and w = 1.

## Combining artificial neurons for classification

• The output of the previous example is only applicable for regression problems.
• We can easily modify this for classfication by changing the output layers to say the Sigmoid function, which gives a numerical value between 0 and 1 and can be thought of as the propensity score. P(y_i = 1 | \boldsymbol{x}_i) = \frac{1}{1 + \exp\left(-(b + \sum_{k=1}^Kw_kh(\boldsymbol{\beta}_k^\top\boldsymbol{x}_i))\right)}.

## Regression vs classification

• In a neural network, initial layers can be identical for regression or classification.
• It is the output layer that determines if it can be used for regression or classification!

# Multi-class classification

## Multi-class classification

• The Sigmoid functions allows for computation of propensity scores for binary outcomes.
• If you have more than two classes in your response, then you need to convert it to dummy variables, or otherwise referred to as one-hot encoding.
• So for a categorical variable with m levels, y_i \in \{\text{Class 1}, \dots, \text{Class }m\}, we convert it as: y_{ik} = \begin{cases}1 & \text{if } y_i = \text{Class}k\\0 & \text{if } y_i \neq \text{Class}k\end{cases}

## From categorical variable to dummy variables

dat <- tibble(pet = c("cat", "dog", "cat", "fish"))
model.matrix(~ pet - 1, data = dat)
  petcat petdog petfish
1      1      0       0
2      0      1       0
3      1      0       0
4      0      0       1
attr(,"assign")
 1 1 1
attr(,"contrasts")
attr(,"contrasts")$pet  "contr.treatment" Alternatively, class_levels <- as.numeric(factor(dat$pet)) - 1 # changes to cat = 0, dog = 1, fish = 2
keras::to_categorical(class_levels, num_classes = 3)
     [,1] [,2] [,3]
[1,]    1    0    0
[2,]    0    1    0
[3,]    1    0    0
[4,]    0    0    1

## Softmax activation function

• The Sigmoid function only works for m = 2.

• For m > 2, we can use the Softmax activation function instead: P(y_{ij} = 1 | \boldsymbol{x}_i) = \frac{\exp(\boldsymbol{\beta}_j^\top\boldsymbol{x}_i)}{\sum_{j=1}^m\exp\left(\boldsymbol{\beta}_j^\top\boldsymbol{x}_i\right)}.

• The number of neurons for the Softmax layer must be m.

• Note that \sum_{j=1}^m P(y_{ij} = 1 | \boldsymbol{x}_i) = 1.

## An illustration of a Softmax layer

• Suppose we have the income of a customer in thousands of dollar.
• We want to predict if the customer will buy a “cheap”, “average” or “expensive” brand of clothing brand.
• What is the probability that customer with an income of $45K will buy a cheap brand based on this trained neural network? ## Solution Layer 2 • ReLU: \max(0, 49.62 - 0.37 \times 45) = 32.97 • ReLU: \max(0, 27.62 - 0.19 \times 45) = 19.07 • ReLU: \max(0, -1.72 + 0.35 \times 45) = 14.03 Output layer • Cheap: \frac{\exp(32.97)}{\exp(32.97) + \exp(19.07) + \exp(14.03)} = 0.9999991 • Average: \frac{\exp(19.07)}{\exp(32.97) + \exp(19.07) + \exp(14.03)} = 0.00000092 • Expensive: \frac{\exp(14.03)}{\exp(32.97) + \exp(19.07) + \exp(14.03)} = 0.0000000059 Prediction: The customer will buy the cheap brand. # Building a neural network structure with R ## Installing keras • Keras is an open-source software library that uses the TensorFlow library to fit artificial neural networks. • We can use the Keras library through the keras package in R. • To install keras, run the following commands: install.packages("keras") library(keras) install_keras(method = c("conda"), conda = "auto", tensorflow = "default", extra_packages = "tensorflow-hub") • Be warned that the installation poses issues often due to keras looking at the wrong location for the Keras library. ## Building a neural network structure in R library(keras) model <- keras_model_sequential() %>% layer_dense(units = 3, input_shape = 2, activation = "relu") %>% layer_dense(units = 3, activation = "softmax") • keras_model_sequential() must be used first to initialise the architechture. • layer_dense() indicates a new layer with: • units indicating the number of neurons in that layer, • input_shape indicating the number of predictors (only needed for the first layer_dense()), • activation specifying the activation function for that layer. ## Examining the weights and biases • You can extract the weights and biases using get_weights(): get_weights(model) [] [,1] [,2] [,3] [1,] 0.4419321 -0.3501425 -0.0361855 [2,] -0.9320850 -0.9673537 0.7750421 []  0 0 0 [] [,1] [,2] [,3] [1,] 0.8876681 -0.5565016 0.4099801 [2,] 0.8715661 0.2542915 0.3897784 [3,] 0.9583607 0.3416290 0.6392198 []  0 0 0 • Every even entry is the bias of the nodes (here all 0s). • The weights are given in every odd entry in the order of the layers. ## Manually setting the weights w <- get_weights(model) w[] <- matrix(c(49.62, -0.37, 27.62, -0.19, -1.72, 0.35), nrow = 2) w[] <- diag(3) set_weights(model, w) get_weights(model) [] [,1] [,2] [,3] [1,] 49.62 27.62 -1.72 [2,] -0.37 -0.19 0.35 []  0 0 0 [] [,1] [,2] [,3] [1,] 1 0 0 [2,] 0 1 0 [3,] 0 0 1 []  0 0 0 • Let’s manually set the weights as the previous example: ## Prediction from neural network model • Normally we need to train the model but this will be covered next week. • Suppose the manually set weights are a result from a trained model. • You can predict the probability a customer with an income of$45K will buy a cheap, average or expensive brand with:
predict(model, cbind(1, 45))
         [,1]        [,2]         [,3]
[1,] 0.999999 9.18979e-07 5.949233e-09
• Note that you need to have the new data in a matrix format with the intercept!

# Takeaways

• Neural networks are flexible models that can be used for both regression and classification problems.
• The activation function in the output layer determine if the neural network can be used for regression or classification.
• More neural network to come next week! 