Deep Learning Specialization - Coursera
main
main
  • Introduction
  • Neural Networks and Deep Learning
    • Introduction to Deep Learning
    • Logistic Regression as a Neural Network (Neural Network Basics)
    • Shallow Neural Network
    • Deep Neural Network
  • Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization
    • Practical Aspects of Deep Learning
    • Optimization Algorithms
    • Hyperparameter Tuning, Batch Normalization and Programming Frameworks
  • Structuring Machine Learning Projects
    • Introduction to ML Strategy
    • Setting Up Your Goal
    • Comparing to Human-Level Performance
    • Error Analysis
    • Mismatched Training and Dev/Test Set
    • Learning from Multiple Tasks
    • End-to-End Deep Learning
  • Convolutional Neural Networks
    • Foundations of Convolutional Neural Networks
    • Deep Convolutional Models: Case Studies
      • Classic Networks
      • ResNets
      • Inception
    • Advice for Using CNNs
    • Object Detection
      • Object Localization
      • Landmark Detection
      • Sliding Window Detection
      • The YOLO Algorithm
      • Intersection over Union
      • Non-Max Suppression
      • Anchor Boxes
      • Region Proposals
    • Face Recognition
      • One-Shot Learning
      • Siamese Network
      • Face Recognition as Binary Classification
    • Neural Style Transfer
  • Sequence Models
    • Recurrent Neural Networks
      • RNN Structure
      • Types of RNNs
      • Language Modeling
      • Vanishing Gradient Problem in RNNs
      • Gated Recurrent Units (GRUs)
      • Long Short-Term Memory Network (LSTM)
      • Bidirectional RNNs
    • Natural Language Processing & Word Embeddings
      • Introduction to Word Embeddings
      • Learning Word Embeddings: Word2Vec and GloVe
      • Applications using Word Embeddings
      • De-Biasing Word Embeddings
    • Sequence Models & Attention Mechanisms
      • Sequence to Sequence Architectures
        • Basic Models
        • Beam Search
        • Bleu Score
        • Attention Model
      • Speech Recognition
Powered by GitBook
On this page
  • Mini-Batch Gradient Descent
  • Choosing Mini-Batch Size
  • Gradient Descent with Momentum
  • RMSProp
  • Adam Optimization
  • Deciding the Learning Rate

Was this helpful?

  1. Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization

Optimization Algorithms

PreviousPractical Aspects of Deep LearningNextHyperparameter Tuning, Batch Normalization and Programming Frameworks

Last updated 4 years ago

Was this helpful?

This chapter deals with algorithms that can optimize training.

Mini-Batch Gradient Descent

Deep Learning needs extremely large training sets. Performing Gradient Descent on such large training sets takes very long.

Instead, we must split the training set into mini-training sets (called mini-batches). Mini-Batch t is denoted as (X{t},Y{t})(X^{\{t\}}, Y^{\{t\}})(X{t},Y{t}).

Gradient Descent is then performed on the mini-batches in a loop. This is much faster than performing Gradient Descent over the entire training set at once (i.e. Batch Gradient Descent).

Note that when we train using the entire training set at once, the cost must go down with every iteration. However, when we train on mini-batches, the cost may not go down with every iteration, but must still trend downwards.

Choosing Mini-Batch Size

If mini-batch size = m, we would be performing batch gradient descent i.e. processing the entire training set at once.

If mini-batch size = 1, we would be performing stochastic gradient descent i.e. processing one training example at a time.

We must choose mini-batch size between 1 and m.

Note that if training set size is small (m<=2000), simply use batch gradient descent. Otherwise, try with mini-batch size = 64, 128, 256, 512.

Make sure that the mini-batches fit in CPU/GPU memory!

Gradient Descent with Momentum

The basic idea is to calculate an exponentially weighted average of the gradients and use that gradient to update the weights. This will speed up gradient descent.

To calculate an exponentially weighted average, we use:

vt=βvt−1+(1−β)θtv_t = \beta v_{t-1} + (1-\beta) \theta_tvt​=βvt−1​+(1−β)θt​

where β\betaβ is a constant between 0 and 1 and θt\theta_tθt​ is the value at time = t.

The momentum algorithm is as follows:

for every iteration t:

compute dW, db on current mini-batch

VdW=βVdW+(1−β)dWV_{dW} = \beta V_{dW} + (1-\beta) dWVdW​=βVdW​+(1−β)dW

Vdb=βVdb+(1−β)dbV_{db} = \beta V_{db} + (1-\beta) dbVdb​=βVdb​+(1−β)db

W=W−αVdWW = W - \alpha V_{dW}W=W−αVdW​

b=b−αVdbb = b - \alpha V_{db}b=b−αVdb​

Note that β=0.9\beta = 0.9β=0.9 is a pretty robust value.

RMSProp

This stands for Room Mean Square prop. It is also used to speed up gradient descent.

for every iteration t:

compute dW, db on current mini-batch

SdW=β2SdW+(1−β2)dW2S_{dW} = \beta_2 S_{dW} + (1-\beta_2) dW^2SdW​=β2​SdW​+(1−β2​)dW2

Sdb=β2Sdb+(1−β2)db2S_{db} = \beta_2 S_{db} + (1-\beta_2) db^2Sdb​=β2​Sdb​+(1−β2​)db2

W=W−αdWSdWW = W - \alpha \frac{dW}{\sqrt{S_{dW}}}W=W−αSdW​​dW​

b=b−αdbSdbb = b - \alpha \frac{db}{\sqrt{S_{db}}}b=b−αSdb​​db​

β2=0.999\beta_2 = 0.999β2​=0.999 is commonly used.

Adam Optimization

This is another algorithm used to speed up gradient descent. It stands for "adaptive momentum estimation" and is a combination of momentum and RMSProp.

for every iteration t:

compute dW, db on current mini-batch

VdW=β1VdW+(1−β1)dWV_{dW} = \beta_1 V_{dW} + (1-\beta_1) dWVdW​=β1​VdW​+(1−β1​)dW

Vdb=β1Vdb+(1−β1)dbV_{db} = \beta_1 V_{db} + (1-\beta_1) dbVdb​=β1​Vdb​+(1−β1​)db

SdW=β2SdW+(1−β2)dW2S_{dW} = \beta_2 S_{dW} + (1-\beta_2) dW^2SdW​=β2​SdW​+(1−β2​)dW2

Sdb=β2Sdb+(1−β2)db2S_{db} = \beta_2 S_{db} + (1-\beta_2) db^2Sdb​=β2​Sdb​+(1−β2​)db2

VdWcorrected=VdW1−β1tV_{dW}^{corrected} = \frac{V_{dW}}{1-\beta_1^t}VdWcorrected​=1−β1t​VdW​​ (this is called bias correction)

Vdbcorrected=Vdb1−β1tV_{db}^{corrected} = \frac{V_{db}}{1-\beta_1^t}Vdbcorrected​=1−β1t​Vdb​​

SdWcorrected=SdW1−β2tS_{dW}^{corrected} = \frac{S_{dW}}{1-\beta_2^t}SdWcorrected​=1−β2t​SdW​​

Sdbcorrected=Sdb1−β2tS_{db}^{corrected} = \frac{S_{db}}{1-\beta_2^t}Sdbcorrected​=1−β2t​Sdb​​

W=W−αVdWcorrectedSdWcorrected+ϵW = W - \alpha \frac{V_{dW}^{corrected}}{\sqrt{S_{dW}^{corrected}}+\epsilon}W=W−αSdWcorrected​​+ϵVdWcorrected​​

b=b−αVdbcorrectedSdbcorrected+ϵb = b - \alpha \frac{V_{db}^{corrected}}{\sqrt{S_{db}^{corrected}}+\epsilon}b=b−αSdbcorrected​​+ϵVdbcorrected​​

commonly used values: β1=0.9,β2=0.999,ϵ=10−8\beta_1 = 0.9, \beta_2 = 0.999, \epsilon=10^{-8}β1​=0.9,β2​=0.999,ϵ=10−8

Deciding the Learning Rate α\alphaα

The learning rate must decay over time (this is called learning rate decay). There are several ways to set the value of α\alphaα dusing training:

(Note: an epoch denotes one pass through the data)

  • α=11+decay_rate∗epoch_num\alpha = \frac{1}{1+decay\_rate*epoch\_num}α=1+decay_rate∗epoch_num1​ where decay_rate is a hyperparameter that must be tuned (usually set to 1)

  • α=0.95epoch_num∗α0\alpha = 0.95^{epoch\_num} * \alpha_0α=0.95epoch_num∗α0​(this is called exponential decay)

  • α=kepoch_numα0\alpha = \frac{k}{\sqrt{epoch\_num}}\alpha_0α=epoch_num​k​α0​ where k is a constant

  • α=ktα0\alpha = \frac{k}{\sqrt{t}}\alpha_0α=t​k​α0​ where k is a constant and t is the mini-batch number

  • and so on