Deep Learning Specialization - Coursera
main
main
  • Introduction
  • Neural Networks and Deep Learning
    • Introduction to Deep Learning
    • Logistic Regression as a Neural Network (Neural Network Basics)
    • Shallow Neural Network
    • Deep Neural Network
  • Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization
    • Practical Aspects of Deep Learning
    • Optimization Algorithms
    • Hyperparameter Tuning, Batch Normalization and Programming Frameworks
  • Structuring Machine Learning Projects
    • Introduction to ML Strategy
    • Setting Up Your Goal
    • Comparing to Human-Level Performance
    • Error Analysis
    • Mismatched Training and Dev/Test Set
    • Learning from Multiple Tasks
    • End-to-End Deep Learning
  • Convolutional Neural Networks
    • Foundations of Convolutional Neural Networks
    • Deep Convolutional Models: Case Studies
      • Classic Networks
      • ResNets
      • Inception
    • Advice for Using CNNs
    • Object Detection
      • Object Localization
      • Landmark Detection
      • Sliding Window Detection
      • The YOLO Algorithm
      • Intersection over Union
      • Non-Max Suppression
      • Anchor Boxes
      • Region Proposals
    • Face Recognition
      • One-Shot Learning
      • Siamese Network
      • Face Recognition as Binary Classification
    • Neural Style Transfer
  • Sequence Models
    • Recurrent Neural Networks
      • RNN Structure
      • Types of RNNs
      • Language Modeling
      • Vanishing Gradient Problem in RNNs
      • Gated Recurrent Units (GRUs)
      • Long Short-Term Memory Network (LSTM)
      • Bidirectional RNNs
    • Natural Language Processing & Word Embeddings
      • Introduction to Word Embeddings
      • Learning Word Embeddings: Word2Vec and GloVe
      • Applications using Word Embeddings
      • De-Biasing Word Embeddings
    • Sequence Models & Attention Mechanisms
      • Sequence to Sequence Architectures
        • Basic Models
        • Beam Search
        • Bleu Score
        • Attention Model
      • Speech Recognition
Powered by GitBook
On this page
  • Hyperparameter Tuning
  • Batch Normalization
  • Softmax Regression
  • Programming Frameworks

Was this helpful?

  1. Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization

Hyperparameter Tuning, Batch Normalization and Programming Frameworks

Hyperparameter Tuning

As discussed earlier, hyperparameters control the parameters of a deep network.

It is, therefore, important to set the right values for these hyperparameters. Doing so can be a time-consuming process.

Some hyperparameters include α\alphaα (the learning rate), β\betaβ (for the momentum algorithm), the number of layers, the number of hidden units, the mini-batch size, learning rate decay,(β1,β2,ϵ)(\beta_1, \beta_2, \epsilon)(β1​,β2​,ϵ) for Adam Optimization etc.

In traditional ML, we had much fewer hyperparameters, allowing us to use grid search. However, in Deep Learning, we have a large number of hyperparameters and must instead perform a search over a random set of values for the hyperparameters. A coarse-to-fine approach may be employed, where we first search over random values and then narrow the search to a region where more suitable values exist.

Note that we must use an appropriate scale while choosing hyperparameters randomly.

For example, to get a set of random values between 0.0001 and 1 i.e. between [10−4,100][10^{-4}, 10^0][10−4,100], do:

r = -4 * np.random()  # random power between -4 and 0
x = 10**r  # random value between 10^-4 and 10^0

If we have limited computational resources, we must restrict ourselves to hyperparameter tuning on a single model over several hours/days. However, if we have sufficient computational resources, we can afford to try out different hyperparameter settings on models in parallel, and choose the one that works best.

Batch Normalization

It was earlier discussed that normalizing the inputs could speed up training.

Batch normalization aims at normalizing the z values of each layer which then get passed through an activation function and become the input for the next layer of a neural network. This speeds up training.

Given some intermediate values z(1),z(2),z(3),...,z(m)z^{(1)}, z^{(2)}, z^{(3)}, ..., z^{(m)}z(1),z(2),z(3),...,z(m):

μ=1m∑i=1mz(i)\mu = \frac{1}{m}\sum_{i=1}^{m}z^{(i)}μ=m1​∑i=1m​z(i)

σ2=1m∑i=1m(z(i)−μ)2\sigma^2 = \frac{1}{m}\sum_{i=1}^{m}(z^{(i)}-\mu)^2σ2=m1​∑i=1m​(z(i)−μ)2

znorm(i)=z(i)−μσ2+ϵz^{(i)}_{norm} = \frac{z^{(i)}-\mu}{\sqrt{\sigma^2 + \epsilon}}znorm(i)​=σ2+ϵ​z(i)−μ​

z~(i)=γznorm(i)+β\tilde{z}^{(i)} = \gamma z^{(i)}_{norm} + \betaz~(i)=γznorm(i)​+β

where γ,β\gamma, \betaγ,β are learnable parameters that are used to ensure that z doesn't have zero mean and unit variance, which is caused by normalization.

(We use ϵ\epsilonϵ to avoid division by 0).

Note that batch normalization is usually applied on mini-batches and when we use batch normalization, we can eliminate the b values for each layer. We must, however, also calculate dγ,dβd\gamma, d\betadγ,dβ during backpropagation and update γ,β\gamma, \betaγ,β as well, while updating W values for each layer.

While testing, since we have only one test example at a time, we can't calculate mean and standard deviation. Instead, we must use μ\muμ and σ2\sigma^2σ2 estimated using an exponentially weighted average across mini-batches seen during training.

Softmax Regression

Linear Regression is used for Binary Classification, whereas Softmax Regression can be used for multi-class classification.

Say we have C class labels. Softmax Regression must output C probabilities, one for each class.

So, in the last layer, we use the softmax activation function, which is as follows:

First calculate t=ez[L]t = e^{z^{[L]}}t=ez[L]

Then, a[L]=t∑tia^{[L]}=\frac{t}{\sum t_i}a[L]=∑ti​t​

The class with the highest a value i.e. highest probability is the predicted class.

For Softmax Regression, we have the following loss and cost functions:

L(y^,y)=−∑i=1Cyilogy^iL(\hat{y}, y) = -\sum_{i=1}^{C}y_i log \hat{y}_iL(y^​,y)=−∑i=1C​yi​logy^​i​

J(w[1],b[1],...)=1m∑i=1mL(y^(i),y(i))J(w^{[1]}, b^{[1]},...) = \frac{1}{m}\sum_{i=1}^{m}L(\hat{y}^{(i)}, y^{(i)})J(w[1],b[1],...)=m1​∑i=1m​L(y^​(i),y(i))

Programming Frameworks

There are several deep learning frameworks that make it easier to apply deep learning, without having to implement everything from scratch. Some of them include:

  • Caffe/Caffe2

  • TensorFlow

  • Torch

  • Keras

  • Theano

  • CNTK

  • DL4J

  • Lasagne

  • mxnet

  • PaddlePaddle

PreviousOptimization AlgorithmsNextStructuring Machine Learning Projects

Last updated 4 years ago

Was this helpful?