Batch size determines how many example the model will look at before making a weight update.[15] Larger batch sizes during iteration allows the model to take bigger step-sizes, which means that the optimization algorithm will make progress faster and hence that the final model will be learnt faster.  The larger batch size is very attractive computationally in case of deep learning with GPU’s i.e. until the memory is filled up, as it allows easy parallelism across processors / machines. However, the batch size cannot be increased infinitely and cannot exceed 1/L where L is Lipchitz constant or smoothness constraint. [16]

 

While selecting the optimal batch size, the efficiency of the training and the noisiness of the gradient estimate must also be taken into consideration [16]. For example, let’s consider we have a 10,000-training example. If we take the batch size of 1000 then we will have to compute the gradient on a training size of 1000. This gradient is then used to determine whether each of the weights, should be increased or decreased and the magnitude with which it must be done so, to reduce the output error accuracy.  However, if we take the batch size of 10, it means that our learning process is going to be 100 times slow, since the gradient computation is almost linear to batch size.  The 10-batch size model is going to make 10 parameters update at a time vs the 1000 parameter update done by the 100-batch size model.  This means that the lower batch size model is going to be slower with the training time. [16]

However, on the noisiness part, our 10-size model will be computing the gradient for the 10-training data and hence 10 data size gradients will be a lot noisier than the 1000 data size gradient. Although noise is usually considered bad, in some cases, noise might even be good. For example, if our data contains numerous valleys then with large batch size, our model may get stuck into the first valley it falls into, however in case of the smaller batch size, the noisiness gradient might be enough to push the model out of the first shallower valley or shallow minima to converge into global minima.  In practice, moderate mini-batches (10-500) are generally used, combined with decaying learning rate which guarantees long run convergence while having the ability to jump out of shallow minima [16]  The  increase in minibatch size have also been found to typically decrease the rate of convergence. [17]

Advertisements