It is one of the most important hyper-parameter for Long Short Term Models, always accounting for more than two third of the variance, which makes it very important to set it correctly to achieve good performance. Figure showing the importance and its impact on the error and the time, as observed from a research papers has been attached below to illustrate and emphasize its importance. [1]

Key Takeaway : Need to find the Goldilocks zone for the learning rate

  • Too small learning rate , longer training time and worst error rate (by getting stuck at local minima)
  • Too large learning rate, smaller training time but poor error rate.
  • “The analysis of hyperparameter interactions revealed that even the highest measured interaction (between learning rate and network size) is quite small. This implies that the hyperparameters can be tuned independently. In particular, the learning rate can be calibrated first using a fairly small network, thus saving a lot of experimentation time.” [1]

Note : If you want to see all this in action, I have attached the excel sheet herewith, where you can change the learning rate and see the “Not convergence  / Divergence”, “Learning rate vs training time” , “Learning rate vs error” in action. All the formulae has been already added in the excel sheet, so that when u just  change the   learning rate, you can see all the changes in action.
FILE : Click to download  below.  machine learning plots in excel, data mining

Learning Rate vs Error vs Training Time.jpg












Figure Showing the impact of the learning rate on the error and the training time

The following diagram 2 should also, be able to emphasize on its importance.


Figure. Showing importance of learning rate on accuracy [2]


Too small learning rate
Learning rate = 0.5, Number of  computation steps = Large (11)
learning rate too small.JPG  Cost function at 0.5

Too Large Learning rate: Sub -Optimal Convergence
Learning rate = 5, Number of  computation steps = 2 , Sub-optimal convergence at 3.0
Cost function at 7

Too Large Learning rate : Also possible divergence
diverging problem graphHere  we can easily see that, because of high learning rate, the parameters are being toggle from negative to positive back and froth e.g θ0 toggles from 0.5 to -0.8 to 0.7 to -0.8 to 0.9 to -1.0. All in the hope that it will converge, but as we can see clearly the problem instead of converging is diverging.

divergence due to high learnign rate.JPG