Practical Aspects of Deep Learning
Setting up your Machine Learning Application
Train / Dev / Test Sets
In traditional ML, we usually have a small number of training samples. In such a case, we use a 60-20-20 split for train-dev(val)-test sets. For deep learning, however, we need a large number of training samples. So, we can't use the above split. Instead, we use 98-1-1 split or a 99.5-0.25-0.25 split or even a 99.5-0.4-0.1 split.
It is important that the dev and test images come from the same distribution. This will allow the model to generalize better.
Sometimes, it's okay to have just a train and dev set, and no test set. We could use the dev set to compare the performance of different algorithms/models and then deploy the best model.
Bias / Variance
Bias refers to assumptions made about the training data. High bias leads to underfitting.
Variance denotes how the performance of the model is affected if the training set is varied. High variance leads to overfitting.
We can determine if our model has high bias/variance by calculating training and dev set errors:
Training Set Error | Dev Set Error | Bias | Variance |
High | Low | High | Low |
Low | High | Low | High |
High | High | High | High |
Low | Low | Low | Low |
Basic Recipe for Machine Learning
First, identify if the model has high bias. This can be confirmed by a high training set error. If the model has high bias, do one or more of the following:
Increase the size of the network
Train the network longer
Try a different neural network architecture
Then, identify if the model has high variance. This can be confirmed by a high dev set error. If the model has high variance, do one or more of the following:
Use more training data
Perform Regularization (to reduce overfitting)
Try a different neural network architecture
In traditional ML, we spoke of a bias-variance trade-off. This was because there was no way to just reduce the bias or just reduce the variance, without the other factor being affected. However, in the deep learning era, by applying the above steps, we can reduce just the bias without affecting the variance and reduce just the variance without affecting the bias, thereby making a trade-off a thing of the past!
Regularizing your Neural Network
Regularization
Regularization is a means to reduce overfitting, thus solving the high variance problem.
To perform regularization, we add a regularization term to the cost function.
For Logistic Regression, we have the following cost function, as discussed earlier:
To perform regularization, we modify it as follows:
is the L2 norm and is called the Regularization Parameter. It is chosen upon cross-validation.
L2 regularization is the most common type of regularization.
L1 regularization would be done by adding the term to the cost function, but isn't as commonly used as L2 regularization.
Note: and
For a neural network, L2 regularization (here referred to as Frobenius norm) is performed as follows:
where
Dropout Regularization
In this type of regularization, we randomly "turn off" certain neurons in each layer in every iteration during training. Turning off a neuron would remove incoming and outgoing edges of the neuron. This makes the net smaller in each iteration, thereby having a regularizing effect.
Other Regularization Methods
In addition to L2 and dropout regularization, there are other methods to reduce/prevent overfitting. Some of them include:
Data Augmentation: This refers to increasing the amount of training data. It can be done by performing transformations like mirroring or distortion to the existing training images, so as to get more usable training data. The presence of more training data can help a model generalize well and reduce the amount of overfitting significantly.
Early Stopping: Early stopping is basically stopping the training once the loss starts to increase (i.e. when the validation (dev set) accuracy starts to decrease).
Optimizing Your Learning Problem
These are a few ways to optimize and speed up the training process.
Normalizing Inputs
One way of speeding up training is to normalize the training data. To normalize the training data, do the following:
where and
Normalizing the inputs gives them zero mean and unit variance. It also helps scale the input features.
Note: Even the test set must be normalized using the same and values used for the training set!
Vanishing/Exploding Gradients
Sometimes, while training a large network, the derivatives/gradients can become very small (vanishing) or very large (exploding). This would slow down training. To partially solve this issue, we must carefully initialize weights for the network.
We could initialize them as follows:
c=1 if tanh activation function is being used (called Xavier initialization)
c=2 if ReLU activation function is being used (called He initialization)
Last updated