Optimizer:  Below we will briefly discuss some basic optimizers available for us and finally discuss on our choice of “Adam” over others

1. SGD: Stochastic Gradient Descent: SGD also known as incremental gradient descent tries to find minimum or maximum error via iteration. However, one of the major problem it suffers from is that, when the objective function is not convex or pseudo convex, it is almost sure to converge to a local minimum.Since SGD has trouble navigating the ravines, where they oscillate across slopes while making hesitant and slow progress towards bottom, we will not be using this optimizer. [14]
2. Nesterov accelerated gradient: We can understand Nesterov Accelerated Gradient better with the following example. Let’s imagine a ball that rolls down a hill blindly following the slope. However, it will be nice to have a smarter ball that knows where it’s going and slows down before the hill slopes up again. The Nesterov accelerated gradient gives this precision for our momentum term. [14].
5. RMSProp: RMSProp and AdaDelta have both been developed independently to resolve the Adagrad’s diminishing learning rate problem. Unlike in AdaDelta however we need to specify the Gamma and learning rate (n), which is suggested to be set to 0.9 and 0.001 by the RMSProp algorithm developers Hilton. [14]
6. Adam: It is also another method that calculates learning rate for each parameter that is shown by its developers to work well in practice and to compare favorably against other adaptive learning algorithms. The developers also propose the default values for the Adam optimizer parameters as Beta1 – 0.9 Beta2 – 0.999 and Epsilon – 10^-8 [14]

To summarize, RMSProp, AdaDelta and Adam are very similar algorithm and since Adam was found to slightly outperform RMSProp, Adam is generally chosen as the best overall choice. [14]

We can also see the convergence rate for different optimizer with the images below.

Figure Showing the  optimisers on the loss surface[1]