Practical Aspects of Deep Learning

Setting up your Machine Learning Application

Train / Dev / Test Sets

In traditional ML, we usually have a small number of training samples. In such a case, we use a 60-20-20 split for train-dev(val)-test sets. For deep learning, however, we need a large number of training samples. So, we can't use the above split. Instead, we use 98-1-1 split or a 99.5-0.25-0.25 split or even a 99.5-0.4-0.1 split.

It is important that the dev and test images come from the same distribution. This will allow the model to generalize better.

Sometimes, it's okay to have just a train and dev set, and no test set. We could use the dev set to compare the performance of different algorithms/models and then deploy the best model.

Bias / Variance

Bias refers to assumptions made about the training data. High bias leads to underfitting.

Variance denotes how the performance of the model is affected if the training set is varied. High variance leads to overfitting.

We can determine if our model has high bias/variance by calculating training and dev set errors:

Training Set Error

Dev Set Error

Bias

Variance

High

Low

High

Low

Low

High

Low

High

High

High

High

High

Low

Low

Low

Low

Basic Recipe for Machine Learning

First, identify if the model has high bias. This can be confirmed by a high training set error. If the model has high bias, do one or more of the following:

  • Increase the size of the network

  • Train the network longer

  • Try a different neural network architecture

Then, identify if the model has high variance. This can be confirmed by a high dev set error. If the model has high variance, do one or more of the following:

  • Use more training data

  • Perform Regularization (to reduce overfitting)

  • Try a different neural network architecture

In traditional ML, we spoke of a bias-variance trade-off. This was because there was no way to just reduce the bias or just reduce the variance, without the other factor being affected. However, in the deep learning era, by applying the above steps, we can reduce just the bias without affecting the variance and reduce just the variance without affecting the bias, thereby making a trade-off a thing of the past!

Regularizing your Neural Network

Regularization

Regularization is a means to reduce overfitting, thus solving the high variance problem.

To perform regularization, we add a regularization term to the cost function.

For Logistic Regression, we have the following cost function, as discussed earlier:

J(w,b)=1mi=1mL(y^(i),y(i))J(w, b) = \frac{1}{m}\sum_{i=1}^{m}L(\hat{y}^{(i)}, y^{(i)})

To perform regularization, we modify it as follows:

J(w,b)=1mi=1mL(y^(i),y(i))+λ2mw22J(w, b) = \frac{1}{m}\sum_{i=1}^{m}L(\hat{y}^{(i)}, y^{(i)}) + \frac{\lambda}{2m}||w||_2^2

λ2mw22\frac{\lambda}{2m}||w||_2^2 is the L2 norm and λ\lambda is called the Regularization Parameter. It is chosen upon cross-validation.

L2 regularization is the most common type of regularization.

L1 regularization would be done by adding the term λ2mw1\frac{\lambda}{2m}||w||_1 to the cost function, but isn't as commonly used as L2 regularization.

Note: w22=j=1nxwj2=wTw||w||_2^2 = \sum_{j=1}^{n_x} w_j^2 = w^Tw and w1=j=1nxwj||w||_1 = \sum_{j=1}^{n_x} |w_j|

For a neural network, L2 regularization (here referred to as Frobenius norm) is performed as follows:

J(w[1],b[1],w[2],b[2],...,w[L],b[L])=1mi=1mL(y^(i),y(i))+λ2ml=1Lw[l]F2J(w^{[1]}, b^{[1]}, w^{[2]}, b^{[2]}, ..., w^{[L]}, b^{[L]}) = \frac{1}{m}\sum_{i=1}^{m}L(\hat{y}^{(i)}, y^{(i)}) + \frac{\lambda}{2m}\sum_{l=1}^{L}||w^{[l]}||_F^2

where w[l]F2=i=1n[l1]j=1n[l](wij[l])2||w^{[l]}||_F^2 = \sum_{i=1}^{n^{[l-1]}}\sum_{j=1}^{n^{[l]}} (w_{ij}^{[l]})^2

Dropout Regularization

In this type of regularization, we randomly "turn off" certain neurons in each layer in every iteration during training. Turning off a neuron would remove incoming and outgoing edges of the neuron. This makes the net smaller in each iteration, thereby having a regularizing effect.

Other Regularization Methods

In addition to L2 and dropout regularization, there are other methods to reduce/prevent overfitting. Some of them include:

  • Data Augmentation: This refers to increasing the amount of training data. It can be done by performing transformations like mirroring or distortion to the existing training images, so as to get more usable training data. The presence of more training data can help a model generalize well and reduce the amount of overfitting significantly.

  • Early Stopping: Early stopping is basically stopping the training once the loss starts to increase (i.e. when the validation (dev set) accuracy starts to decrease).

Optimizing Your Learning Problem

These are a few ways to optimize and speed up the training process.

Normalizing Inputs

One way of speeding up training is to normalize the training data. To normalize the training data, do the following:

x=xμσ2x = \frac{x-\mu}{\sqrt{\sigma^2}}

where μ(mean)=1mi=1mx(i)\mu (mean) = \frac{1}{m} \sum_{i=1}^m x^{(i)} and σ2(variance)=1mi=1mx(i)2\sigma^2 (variance) = \frac{1}{m}\sum_{i=1}^{m}x^{(i)2}

Normalizing the inputs gives them zero mean and unit variance. It also helps scale the input features.

Note: Even the test set must be normalized using the same μ\mu and σ2\sigma^2 values used for the training set!

Vanishing/Exploding Gradients

Sometimes, while training a large network, the derivatives/gradients can become very small (vanishing) or very large (exploding). This would slow down training. To partially solve this issue, we must carefully initialize weights for the network.

We could initialize them as follows:

W[l]=np.random.randn(dimensions)np.sqrt(cn[l1])W^{[l]} = np.random.randn(dimensions) * np.sqrt(\frac{c}{n^{[l-1]}})

c=1 if tanh activation function is being used (called Xavier initialization)

c=2 if ReLU activation function is being used (called He initialization)

Last updated