Importance:

Optimisers play a very crucial role to increasing the accuracy of the model.

There exists many optimiser variants that can be used. We will briefly discuss  various variants and their pros and cons

Variants

1. SGD: Stochastic Gradient Descent: SGD also known as incremental gradient descent tries to find minimum or maximum error via iteration.

Drawback : However, one of the major problem it suffers from is that, when the objective function is not convex or pseudo convex, it is almost sure to converge to a local minimum. Since SGD has trouble navigating the ravines, where they oscillate across slopes while making hesitant and slow progress towards bottom, we will not be using this optimizer. [1]

2. Nesterov accelerated gradient: We can understand Nesterov Accelerated Gradient better with the following example. Let’s imagine a ball that rolls down a hill blindly following the slope. However, it will be nice to have a smarter ball that knows where it’s going and slows down before the hill slopes up again. The Nesterov accelerated gradient gives this precision for our momentum term. [1].

One of the main benefit of the Adagrad is that it eliminates the need to manually tune the learning rate.  However, one of the main disadvantages is its accumulation of squared denominator which keeps growing in training and thus makes the learning rate infinitesimally small, such that no more learning / acquiring of the additional knowledge is possible. [1]

5. RMSProp: RMSProp and AdaDelta have both been developed independently to resolve the Adagrad’s diminishing learning rate problem. Unlike in AdaDelta however we need to specify the Gamma and learning rate (n), which is suggested to be set to 0.9 and 0.001 by the RMSProp algorithm developers Hilton. [1]

6. Adam: It is also another method that calculates learning rate for each parameter that is shown by its developers to work well in practice and to compare favorably against other adaptive learning algorithms. The developers also propose the default values for the Adam optimizer parameters as Beta1 – 0.9 Beta2 – 0.999 and Epsilon – 10^-8 [14]

Figure Showing the  optimisers on the loss surface[1]

CONCLUSION : To summarize, RMSProp, AdaDelta and Adam are very similar algorithm and since Adam was found to slightly outperform RMSProp, Adam is generally chosen as the best overall choice. [1]

Reference