Optimization Algorithms

This chapter deals with algorithms that can optimize training.

Mini-Batch Gradient Descent

Deep Learning needs extremely large training sets. Performing Gradient Descent on such large training sets takes very long.

Gradient Descent is then performed on the mini-batches in a loop. This is much faster than performing Gradient Descent over the entire training set at once (i.e. Batch Gradient Descent).

Note that when we train using the entire training set at once, the cost must go down with every iteration. However, when we train on mini-batches, the cost may not go down with every iteration, but must still trend downwards.

If mini-batch size = m, we would be performing batch gradient descent i.e. processing the entire training set at once.

If mini-batch size = 1, we would be performing stochastic gradient descent i.e. processing one training example at a time.

We must choose mini-batch size between 1 and m.

Note that if training set size is small (m<=2000), simply use batch gradient descent. Otherwise, try with mini-batch size = 64, 128, 256, 512.

Make sure that the mini-batches fit in CPU/GPU memory!

Gradient Descent with Momentum

The basic idea is to calculate an exponentially weighted average of the gradients and use that gradient to update the weights. This will speed up gradient descent.

To calculate an exponentially weighted average, we use:

The momentum algorithm is as follows:

for every iteration t:

compute dW, db on current mini-batch

RMSProp

This stands for Room Mean Square prop. It is also used to speed up gradient descent.

for every iteration t:

compute dW, db on current mini-batch

Adam Optimization

This is another algorithm used to speed up gradient descent. It stands for "adaptive momentum estimation" and is a combination of momentum and RMSProp.

for every iteration t:

compute dW, db on current mini-batch

(Note: an epoch denotes one pass through the data)

  • and so on

Last updated