MLP

This is the other name for a neural network.

Advantages

  • Good accuracy on even data that is far from linearly separable

  • Can learn complicated functions or concepts

Disadvantages

  • Danger of overfitting

  • Slow to train

Some Notations

Consider a simple 2 layer neural network (the input layer is not counted as a layer):

1 output unit, H hidden nodes (+1 dummy bias unit), d input nodes (+1 bias unit) Fully Connected: every node in a layer is connected to every node in the previous layer.

(H+1) + H(d+1) weights to learn.

whjw_{h_j}is the weight on the edge from input node xjx_j to hidden node h, vhv_his the weight on the edhe from hidden node h to the output node.

Z0,Z1,...,ZHZ_0, Z_1,...,Z_H (with Z0=1Z_0=1) are the activations from the hidden layer, usually sigmoid i.e. Zh=11+ewTxZ_h=\frac{1}{1+e^{-w^Tx}}

The Error Function for Regression is given by:

E(W,v)=12t(rtyt)2E(W,v)=\frac{1}{2}\sum_t (r^t-y^t)^2 i.e. the mean squared error, with W=weights whjw_{h_j} and v=weights vhv_h and y=vTZy=v^TZ i.e. a _linear activation function _i.e. y=v0.1+v1Z1+...+vHZHy = v_0.1+v_1Z_1+...+v_HZ_H

The Error Function for Classification is given by:

E(W,v)=trtlogyt+(1rt)log(1yt)E(W,v) = -\sum_t r^tlogy^t + (1-r^t)log(1-y^t) i.e. the cross entropy loss

where y=11+evTZy = \frac{1}{1+e^{-v^TZ}} i.e. the sigmoid function

Batch Gradient Descent

We must find W, v that minimize the error.

1. Initialize W and v
2. Repeat unitl convergence:
       compute v^t for each x^t in training
       update each v_h and w_h_j by doing:
       v_h = v_h - eta * dE/dv_h
       w_h_j = w_h_j - eta * dE/dw_h_j

We have:

E/vh=t(rtyt)Zht\partial E/\partial v_h = \sum_t (r^t-y^t)Z_h^t

E/whj=t(rtyt)vhzht(1zht)xjt\partial E/\partial w_{h_j} = \sum_t -(r^t-y^t)v_hz_h^t(1-z_h^t)x_j^t

(computed using chain rule i.e. E/whj=tE/yy/zhtzht/whj\partial E/\partial w_{h_j} = \sum_t \partial E/\partial y * \partial y/\partial z_h^t * \partial z_h^t/\partial w_{h_j})

This technique is called backpropagation.

Last updated