Optimizer:  Below we will briefly discuss some basic optimizers available for us and finally discuss on our choice of “Adam” over others

  1. SGD: Stochastic Gradient Descent: SGD also known as incremental gradient descent tries to find minimum or maximum error via iteration. However, one of the major problem it suffers from is that, when the objective function is not convex or pseudo convex, it is almost sure to converge to a local minimum.Since SGD has trouble navigating the ravines, where they oscillate across slopes while making hesitant and slow progress towards bottom, we will not be using this optimizer. [14]
  2. Nesterov accelerated gradient: We can understand Nesterov Accelerated Gradient better with the following example. Let’s imagine a ball that rolls down a hill blindly following the slope. However, it will be nice to have a smarter ball that knows where it’s going and slows down before the hill slopes up again. The Nesterov accelerated gradient gives this precision for our momentum term. [14].
  3. Adagrad: In Nesterov accelerated gradient, while we could adapt our gradient updates to the slope and speed up the SGD, it would be even better if we could adapt our updates to each individual parameter i.e. perform larger updates for infrequent parameters and smaller updates for the frequent parameters. This is achieved with Adagrad. This in turn increases speed, scalability and robustness of SGD and could be used to train large scale neural nets. This was found to be especially suitable for sparse data and was at Google used to recognize cats in YouTube videos.[14]One of the main benefit of the Adagrad is that it eliminates the need to manually tune the learning rate.  However, one of the main disadvantages is its accumulation of squared denominator which keeps growing in training and thus makes the learning rate infinitesimally small, such that no more learning / acquiring of the additional knowledge is possible. [14]
  4. AdaDelta: The AdaDelta optimizer is the extension to Adagrad and aims to solve the problem of infinitesimally small learning rate. It does so by ceiling the accumulated past gradient to some fixed window size. As in Adagrad, we do not need to set a default learning rate. [14]
  5. RMSProp: RMSProp and AdaDelta have both been developed independently to resolve the Adagrad’s diminishing learning rate problem. Unlike in AdaDelta however we need to specify the Gamma and learning rate (n), which is suggested to be set to 0.9 and 0.001 by the RMSProp algorithm developers Hilton. [14]
  6. Adam: It is also another method that calculates learning rate for each parameter that is shown by its developers to work well in practice and to compare favorably against other adaptive learning algorithms. The developers also propose the default values for the Adam optimizer parameters as Beta1 – 0.9 Beta2 – 0.999 and Epsilon – 10^-8 [14]


To summarize, RMSProp, AdaDelta and Adam are very similar algorithm and since Adam was found to slightly outperform RMSProp, Adam is generally chosen as the best overall choice. [14]

We can also see the convergence rate for different optimizer with the images below.




Figure Showing the  optimisers on the loss surface[1]

To know more on the stochastic gradient , please click here  “Cost function : Stochastic Gradient Descent