Optimization Algorithms

This chapter deals with algorithms that can optimize training.

Mini-Batch Gradient Descent

Deep Learning needs extremely large training sets. Performing Gradient Descent on such large training sets takes very long.

Instead, we must split the training set into mini-training sets (called mini-batches). Mini-Batch t is denoted as $(X^{\{t\}}, Y^{\{t\}})$ .

Gradient Descent is then performed on the mini-batches in a loop. This is much faster than performing Gradient Descent over the entire training set at once (i.e. Batch Gradient Descent).

Note that when we train using the entire training set at once, the cost must go down with every iteration. However, when we train on mini-batches, the cost may not go down with every iteration, but must still trend downwards.

Choosing Mini-Batch Size

If mini-batch size = m, we would be performing batch gradient descent i.e. processing the entire training set at once.

If mini-batch size = 1, we would be performing stochastic gradient descent i.e. processing one training example at a time.

We must choose mini-batch size between 1 and m.

Note that if training set size is small (m<=2000), simply use batch gradient descent. Otherwise, try with mini-batch size = 64, 128, 256, 512.

Make sure that the mini-batches fit in CPU/GPU memory!

Gradient Descent with Momentum

The basic idea is to calculate an exponentially weighted average of the gradients and use that gradient to update the weights. This will speed up gradient descent.

To calculate an exponentially weighted average, we use:

$v_t = \beta v_{t-1} + (1-\beta) \theta_t$

where $\beta$ is a constant between 0 and 1 and $\theta_t$ is the value at time = t.

The momentum algorithm is as follows:

for every iteration t:

compute dW, db on current mini-batch

$V_{dW} = \beta V_{dW} + (1-\beta) dW$

$V_{db} = \beta V_{db} + (1-\beta) db$

$W = W - \alpha V_{dW}$

$b = b - \alpha V_{db}$

Note that $\beta = 0.9$ is a pretty robust value.

RMSProp

This stands for Room Mean Square prop. It is also used to speed up gradient descent.

for every iteration t:

compute dW, db on current mini-batch

$S_{dW} = \beta_2 S_{dW} + (1-\beta_2) dW^2$

$S_{db} = \beta_2 S_{db} + (1-\beta_2) db^2$

$W = W - \alpha \frac{dW}{\sqrt{S_{dW}}}$

$b = b - \alpha \frac{db}{\sqrt{S_{db}}}$

$\beta_2 = 0.999$ is commonly used.

Adam Optimization

This is another algorithm used to speed up gradient descent. It stands for "adaptive momentum estimation" and is a combination of momentum and RMSProp.

for every iteration t:

compute dW, db on current mini-batch

$V_{dW} = \beta_1 V_{dW} + (1-\beta_1) dW$

$V_{db} = \beta_1 V_{db} + (1-\beta_1) db$

$S_{dW} = \beta_2 S_{dW} + (1-\beta_2) dW^2$

$S_{db} = \beta_2 S_{db} + (1-\beta_2) db^2$

$V_{dW}^{corrected} = \frac{V_{dW}}{1-\beta_1^t}$ (this is called bias correction)

$V_{db}^{corrected} = \frac{V_{db}}{1-\beta_1^t}$

$S_{dW}^{corrected} = \frac{S_{dW}}{1-\beta_2^t}$

$S_{db}^{corrected} = \frac{S_{db}}{1-\beta_2^t}$

$W = W - \alpha \frac{V_{dW}^{corrected}}{\sqrt{S_{dW}^{corrected}}+\epsilon}$

$b = b - \alpha \frac{V_{db}^{corrected}}{\sqrt{S_{db}^{corrected}}+\epsilon}$

commonly used values: $\beta_1 = 0.9, \beta_2 = 0.999, \epsilon=10^{-8}$

Deciding the Learning Rate $\alpha$

The learning rate must decay over time (this is called learning rate decay). There are several ways to set the value of $\alpha$ dusing training:

(Note: an epoch denotes one pass through the data)

$\alpha = \frac{1}{1+decay\_rate*epoch\_num}$ where decay_rate is a hyperparameter that must be tuned (usually set to 1)
$\alpha = 0.95^{epoch\_num} * \alpha_0$ (this is called exponential decay)
$\alpha = \frac{k}{\sqrt{epoch\_num}}\alpha_0$ where k is a constant
$\alpha = \frac{k}{\sqrt{t}}\alpha_0$ where k is a constant and t is the mini-batch number
and so on

PreviousPractical Aspects of Deep Learning NextHyperparameter Tuning, Batch Normalization and Programming Frameworks

Last updated 5 years ago

Was this helpful?

Mini-Batch Gradient Descent

Choosing Mini-Batch Size

Gradient Descent with Momentum

RMSProp

Adam Optimization

Deciding the Learning Rate α\alphaα

Deciding the Learning Rate $\alpha$