Optimization Algorithms
Last updated
Last updated
This chapter deals with algorithms that can optimize training.
Deep Learning needs extremely large training sets. Performing Gradient Descent on such large training sets takes very long.
Instead, we must split the training set into mini-training sets (called mini-batches). Mini-Batch t is denoted as .
Gradient Descent is then performed on the mini-batches in a loop. This is much faster than performing Gradient Descent over the entire training set at once (i.e. Batch Gradient Descent).
Note that when we train using the entire training set at once, the cost must go down with every iteration. However, when we train on mini-batches, the cost may not go down with every iteration, but must still trend downwards.
If mini-batch size = m, we would be performing batch gradient descent i.e. processing the entire training set at once.
If mini-batch size = 1, we would be performing stochastic gradient descent i.e. processing one training example at a time.
We must choose mini-batch size between 1 and m.
Note that if training set size is small (m<=2000), simply use batch gradient descent. Otherwise, try with mini-batch size = 64, 128, 256, 512.
Make sure that the mini-batches fit in CPU/GPU memory!
The basic idea is to calculate an exponentially weighted average of the gradients and use that gradient to update the weights. This will speed up gradient descent.
To calculate an exponentially weighted average, we use:
where is a constant between 0 and 1 and is the value at time = t.
The momentum algorithm is as follows:
for every iteration t:
compute dW, db on current mini-batch
Note that is a pretty robust value.
This stands for Room Mean Square prop. It is also used to speed up gradient descent.
for every iteration t:
compute dW, db on current mini-batch
is commonly used.
This is another algorithm used to speed up gradient descent. It stands for "adaptive momentum estimation" and is a combination of momentum and RMSProp.
for every iteration t:
compute dW, db on current mini-batch
(this is called bias correction)
commonly used values:
The learning rate must decay over time (this is called learning rate decay). There are several ways to set the value of dusing training:
(Note: an epoch denotes one pass through the data)
where decay_rate is a hyperparameter that must be tuned (usually set to 1)
(this is called exponential decay)
where k is a constant
where k is a constant and t is the mini-batch number
and so on