Shallow Neural Network
Last updated
Last updated
As discussed earlier, a neural network has layers of neurons.
Every neural network has an input layer and an output layer, with zero or more hidden layers in between.
The following image depicts a 2-layer neural network. It has an input layer, a hidden layer, and an output layer. Note that the input layer is not counted separately, therefore it is called a 2-layer neural network although it technically has 3 layers.
Every node of each layer is connected to every node of the next layer.
Every edge is associated with a weight and every node is associated with a bias b. At each node, we perform the computation w followed by an activation function.
The overall computation in a 2-layer neural network is as follows:
W is a column vector of all the weights for that layer. b is the column vector for biases in that layer. a is the column vector that stores a values of the layer and becomes the input for the next layer. The final a value is the required probability that a given input belongs to the positive class label.
The above equations are for a single training example. To perform computation across all training examples, replace lowercase letters z, a, x, w with uppercase letters Z, A, X, W where X is a matrix of all the x vectors column-wise and so on.
We have been using the sigmoid activation function, but there are other activation functions that may be more effective.
We must continue using the sigmoid activation function for the output layer because we need to output a probability i.e. the value must be restricted between 0 and 1. But for the intermediate hidden layers, we could use other activation functions.
Sigmoid Activation Function ()
This will always result in a value between 0 and 1.
tanh Activation Function
This will always result in a value between -1 and 1.
ReLU (Rectified Linear Unit) Activation Function
Example code:
Backpropagation is a means to learn optimal w and b values by computing derivatives. To perform backpropagation, we go from right to left in the neural network and compute derivatives (gradients).
In the case of Logistic Regression, we could initialize w and b to 0.
However, for neural networks, w must be initialized randomly. (b can still be initialized as 0).
If we initialize w to 0, each neuron in the hidden layer will perform the same computation. So even after multiple iterations of gradient descent, each neuron in the layer will be computing the same thing as the other neurons.
w = np.random.randn(dimensions) * 0.01
We usually multiply with 0.01 (for shallow neural networks) so as to keep the value of w small, since larger values lead to slower training.
The superscript denotes the layer number.
Using the ReLU Activation Function makes the neural network learn much faster. This is why it is most commonly used.