Choosing the Learning Rate α
Last updated
Was this helpful?
Last updated
Was this helpful?
If is sufficiently small, J() will decrease with every iteration. This is how we know for sure that the gradient descent is working.
However, if the gradient descent is taking too long to converge, may be too small.
On the other hand, if is too large, we may overshoot the minimum, causing the cost function to increase instead of decreasing.
So, to choose an efficient learning rate, plot a graph taking J() on the Y-axis and # of iterations on the X-axis for different values of (Say 0.01, 0.03, 0.1, 0.3, 1 i.e. increasing threefold) to see which value of makes the gradient descent converge the quickest.