MLP

This is the other name for a neural network.

Advantages

Good accuracy on even data that is far from linearly separable
Can learn complicated functions or concepts

Disadvantages

Danger of overfitting
Slow to train

Some Notations

Consider a simple 2 layer neural network (the input layer is not counted as a layer):

1 output unit, H hidden nodes (+1 dummy bias unit), d input nodes (+1 bias unit) Fully Connected: every node in a layer is connected to every node in the previous layer.

(H+1) + H(d+1) weights to learn.

$w_{h_j}$ is the weight on the edge from input node $x_j$ to hidden node h, $v_h$ is the weight on the edhe from hidden node h to the output node.

$Z_0, Z_1,...,Z_H$ (with $Z_0=1$ ) are the activations from the hidden layer, usually sigmoid i.e. $Z_h=\frac{1}{1+e^{-w^Tx}}$

The Error Function for Regression is given by:

$E(W,v)=\frac{1}{2}\sum_t (r^t-y^t)^2$ i.e. the mean squared error, with W=weights $w_{h_j}$ and v=weights $v_h$ and $y=v^TZ$ i.e. a _linear activation function _i.e. $y = v_0.1+v_1Z_1+...+v_HZ_H$

The Error Function for Classification is given by:

$E(W,v) = -\sum_t r^tlogy^t + (1-r^t)log(1-y^t)$ i.e. the cross entropy loss

where $y = \frac{1}{1+e^{-v^TZ}}$ i.e. the sigmoid function

Batch Gradient Descent

We must find W, v that minimize the error.

1. Initialize W and v
2. Repeat unitl convergence:
       compute v^t for each x^t in training
       update each v_h and w_h_j by doing:
       v_h = v_h - eta * dE/dv_h
       w_h_j = w_h_j - eta * dE/dw_h_j

We have:

$\partial E/\partial v_h = \sum_t (r^t-y^t)Z_h^t$

$\partial E/\partial w_{h_j} = \sum_t -(r^t-y^t)v_hz_h^t(1-z_h^t)x_j^t$

(computed using chain rule i.e. $\partial E/\partial w_{h_j} = \sum_t \partial E/\partial y * \partial y/\partial z_h^t * \partial z_h^t/\partial w_{h_j}$ )

This technique is called backpropagation.

PreviousTraining a Neuron NextRegression with Multiple Outputs

Last updated 4 years ago

Was this helpful?