# Hyperparameter Tuning, Batch Normalization and Programming Frameworks

## Hyperparameter Tuning

As discussed earlier, hyperparameters control the parameters of a deep network.

It is, therefore, important to set the right values for these hyperparameters. Doing so can be a time-consuming process.

Some hyperparameters include $$\alpha$$ (the learning rate), $$\beta$$ (for the momentum algorithm), the number of layers, the number of hidden units, the mini-batch size, learning rate decay,$$(\beta\_1, \beta\_2, \epsilon)$$ for Adam Optimization etc.

In traditional ML, we had much fewer hyperparameters, allowing us to use grid search. However, in Deep Learning, we have a large number of hyperparameters and must instead perform a search over a random set of values for the hyperparameters. A *coarse-to-fine* approach may be employed, where we first search over random values and then narrow the search to a region where more suitable values exist.

Note that we must use an appropriate scale while choosing hyperparameters randomly.

For example, to get a set of random values between 0.0001 and 1 i.e. between $$\[10^{-4}, 10^0]$$, do:

```
r = -4 * np.random()  # random power between -4 and 0
x = 10**r  # random value between 10^-4 and 10^0
```

If we have limited computational resources, we must restrict ourselves to hyperparameter tuning on a single model over several hours/days. However, if we have sufficient computational resources, we can afford to try out different hyperparameter settings on models in parallel, and choose the one that works best.

## Batch Normalization

It was earlier discussed that normalizing the inputs could speed up training.

Batch normalization aims at normalizing the z values of each layer which then get passed through an activation function and become the input for the next layer of a neural network. This speeds up training.

Given some intermediate values $$z^{(1)}, z^{(2)}, z^{(3)}, ..., z^{(m)}$$:

$$\mu = \frac{1}{m}\sum\_{i=1}^{m}z^{(i)}$$

$$\sigma^2 = \frac{1}{m}\sum\_{i=1}^{m}(z^{(i)}-\mu)^2$$

$$z^{(i)}\_{norm} = \frac{z^{(i)}-\mu}{\sqrt{\sigma^2 + \epsilon}}$$

$$\tilde{z}^{(i)} = \gamma z^{(i)}\_{norm} + \beta$$

where $$\gamma, \beta$$ are learnable parameters that are used to ensure that z doesn't have zero mean and unit variance, which is caused by normalization.

(We use $$\epsilon$$ to avoid division by 0).

Note that batch normalization is usually applied on mini-batches and when we use batch normalization, we can eliminate the b values for each layer. We must, however, also calculate $$d\gamma, d\beta$$ during backpropagation and update $$\gamma, \beta$$ as well, while updating W values for each layer.

While testing, since we have only one test example at a time, we can't calculate mean and standard deviation. Instead, we must use $$\mu$$ and $$\sigma^2$$ estimated using an exponentially weighted average across mini-batches seen during training.

## Softmax Regression

Linear Regression is used for Binary Classification, whereas Softmax Regression can be used for multi-class classification.

Say we have C class labels. Softmax Regression must output C probabilities, one for each class.

So, in the last layer, we use the *softmax activation function*, which is as follows:

First calculate $$t = e^{z^{\[L]}}$$

Then, $$a^{\[L]}=\frac{t}{\sum t\_i}$$

The class with the highest a value i.e. highest probability is the predicted class.

For Softmax Regression, we have the following loss and cost functions:

$$L(\hat{y}, y) = -\sum\_{i=1}^{C}y\_i log \hat{y}\_i$$

$$J(w^{\[1]}, b^{\[1]},...) = \frac{1}{m}\sum\_{i=1}^{m}L(\hat{y}^{(i)}, y^{(i)})$$

## Programming Frameworks

There are several deep learning frameworks that make it easier to apply deep learning, without having to implement everything from scratch. Some of them include:

* Caffe/Caffe2
* TensorFlow
* Torch
* Keras
* Theano
* CNTK
* DL4J
* Lasagne
* mxnet
* PaddlePaddle


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://vikram-bajaj.gitbook.io/deep-learning-specialization-coursera/improving-deep-neural-networks-hyperparameter-tuning-regularization-and-optimization/h-yperparameter-tuning-batch-normalization-and-programming-frameworks.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
