Introduction to Machine Learning
Lecturer: Emi Tanaka
Department of Econometrics and Business Statistics
This lecture benefited from the lecture notes by Dr. Ruben Loaiza-Maya.
Here, the addition of 2 neurons increases the number of parameters by 10.
Input layer
\boldsymbol{x} = (1, x_1)^\top
Input layer
\boldsymbol{x} = (1, x_1)^\top
Layer 2
Input layer
\boldsymbol{x} = (1, x_1)^\top
Layer 2
Input layer
\boldsymbol{x} = (1, x_1)^\top
Layer 2
Input layer
\boldsymbol{x} = (1, x_1)^\top
Layer 2
Input layer
\boldsymbol{x} = (1, x_1)^\top
Layer 2
Layer 3
Input layer
\boldsymbol{x} = (1, x_1)^\top
Layer 2
Layer 3
Input layer
\boldsymbol{x} = (1, x_1)^\top
Layer 2
Layer 3
Input layer
\boldsymbol{x} = (1, x_1)^\top
Layer 2
Layer 3
Output layer
Input layer
\boldsymbol{x} = (1, x_1)^\top
Layer 2
Layer 3
Output layer
Input layer
\boldsymbol{x} = (1, x_1)^\top
Layer 2
Layer 3
Output layer
\text{MSE}(\boldsymbol{\theta}) = \frac{1}{n}\sum_{i=1}^n\left(y_i - f(\boldsymbol{x}_i~|~\boldsymbol{\theta})\right)^2
\text{BCE}(\boldsymbol{\theta}) = -\frac{1}{n}\sum_{i=1}^n\left\{y_i\log(P(y_i=1~|~\boldsymbol{x}_i,\boldsymbol{\theta})) + (1-y_i)\log(1 - P(y_i=1~|~\boldsymbol{x}_i,\boldsymbol{\theta}))\right\}
\text{CE}(\boldsymbol{\theta}) = -\frac{1}{n}\sum_{i=1}^n\sum_{j=1}^my_{ij}\log(P(y_{ij} = 1~|~\boldsymbol{x}_i,\boldsymbol{\theta}))
\theta^{[s+1]} = \theta^{[s]} - r \left.\frac{\partial\text{MSE}(\theta)}{\partial\theta}\right\vert_{\theta=\theta^{[s]}}
\theta^{[s+1]} = \theta^{[s]} - r \left.\frac{\partial\text{MSE}(\theta)}{\partial\theta}\right\vert_{\theta=\theta^{[s]}}
\theta^{[s+1]} = \theta^{[s]} - r \left.\frac{\partial\text{MSE}(\theta)}{\partial\theta}\right\vert_{\theta=\theta^{[s]}}
\theta^{[s+1]} = \theta^{[s]} - r \left.\frac{\partial\text{MSE}(\theta)}{\partial\theta}\right\vert_{\theta=\theta^{[s]}}
\theta^{[s+1]} = \theta^{[s]} - r \left.\frac{\partial\text{MSE}(\theta)}{\partial\theta}\right\vert_{\theta=\theta^{[s]}}
\cdots
\theta^{[s+1]} = \theta^{[s]} - r \left.\frac{\partial\text{MSE}(\theta)}{\partial\theta}\right\vert_{\theta=\theta^{[s]}}
Epoch 1
Epoch 1
Epoch 2
In evaluating \nabla \hat{\text{MSE}}(\boldsymbol{\theta}), we must evaluate \nabla (y_i - f(\boldsymbol{x}_i~|~\boldsymbol{\theta}))^2.
Now we have: \nabla (y_i - f(\boldsymbol{x}_i~|~\boldsymbol{\theta}))^2 = - 2(y_i - f(\boldsymbol{x}_i~|~\boldsymbol{\theta}))\times \underbrace{\frac{\partial f(\boldsymbol{x}_i|\boldsymbol{\theta})}{\partial \boldsymbol{a}_i^{(L - 1)}}}_{\text{Layer }L -1}\times \underbrace{\frac{\partial \boldsymbol{a}_i^{(L - 1)}}{\partial \boldsymbol{a}_i^{(L - 2)}}}_{\text{Layer }L -2}\times \underbrace{\frac{\partial \boldsymbol{a}_i^{(L - 2)}}{\partial \boldsymbol{a}_i^{(L - 3)}}}_{\text{Layer }L -3}\times \cdots \times \underbrace{\frac{\partial \boldsymbol{a}_i^{(2)}}{\partial \boldsymbol{\theta}}}_{\text{Layer }1}.
The gradient of the later layers is calculated earlier.
For parameter \theta, we can show \frac{\partial (y_i - f(\boldsymbol{x}_i~|~\boldsymbol{\theta}))^2}{\partial \theta} = \text{constant}\times \underbrace{\frac{\partial h_L(a_i^{(L-1)})}{\partial a_i^{(L-1)}}}_{\text{Layer }L -1}\times \underbrace{\frac{\partial h_L(a_i^{(L-2)})}{\partial a_i^{(L-2)}}}_{\text{Layer }L -2}\times \underbrace{\frac{\partial h_L(a_i^{(L-3)})}{\partial a_i^{(L-3)}}}_{\text{Layer }L -3}\times \cdots \times \underbrace{\frac{\partial h_2(x_i| \theta)}{\partial \theta}}_{\text{Layer }1}.
\frac{\partial (y_i - f(\boldsymbol{x}_i~|~\boldsymbol{\theta}))^2}{\partial \theta} = \text{constant}\times \underbrace{\frac{\partial h_L(a_i^{(L-1)})}{\partial a_i^{(L-1)}}}_{\text{Layer }L -1}\times \underbrace{\frac{\partial h_L(a_i^{(L-2)})}{\partial a_i^{(L-2)}}}_{\text{Layer }L -2}\times \underbrace{\frac{\partial h_L(a_i^{(L-3)})}{\partial a_i^{(L-3)}}}_{\text{Layer }L -3}\times \cdots \times \underbrace{\frac{\partial h_2(x_i| \theta)}{\partial \theta}}_{\text{Layer }1}.
scroll
Label | Description |
---|---|
0 | T-shirt/top |
1 | Trouser |
2 | Pullover |
3 | Dress |
4 | Coat |
5 | Sandal |
6 | Shirt |
7 | Sneaker |
8 | Bag |
9 | Ankle boot |
pixel1
, …, pixel784
) and 1 response variable (label
) labelled from 0-9.label
!).keras
package.library(keras)
NN <- keras_model_sequential() %>%
# hidden layer
layer_dense(units = 128, # number of nodes in hidden layer
activation = "relu",
# number of predictors
input_shape = 784) %>%
# output layer
layer_dense(units = 10, # the number of classes
# we need to use softmax for multi-class classification
activation = "softmax")
NN
Model: "sequential"
________________________________________________________________________________
Layer (type) Output Shape Param #
================================================================================
dense_1 (Dense) (None, 128) 100480
dense (Dense) (None, 10) 1290
================================================================================
Total params: 101,770
Trainable params: 101,770
Non-trainable params: 0
________________________________________________________________________________
mean_squared_error
.binary_crossentropy
.categorical_crossentropy
.help("loss-functions")
.This model takes long to fit!
epochs
is a large number.plot
function on the training history.NNl
and not learnNNl
here! [,1] [,2] [,3] [,4] [,5]
[1,] 9.959991e-01 4.470216e-15 2.498414e-06 1.280446e-06 6.907984e-09
[2,] 2.188681e-21 1.000000e+00 5.769808e-32 9.360376e-28 1.490198e-26
[3,] 6.637161e-03 6.112183e-16 9.681513e-01 3.647272e-13 6.959190e-06
[,6] [,7] [,8] [,9] [,10]
[1,] 2.304835e-11 3.900097e-03 9.305310e-19 9.703544e-05 4.587546e-13
[2,] 0.000000e+00 8.628763e-31 0.000000e+00 4.123140e-29 1.746240e-37
[3,] 8.466854e-11 2.520457e-02 8.891557e-21 6.381754e-13 2.705846e-17
ETC3250/5250 Week 12