11.4 Stochastic Gradient Descent &
11.5 Minibatch Stochastic Gradient Descent

Overview : Variants of Gradient Descent

Depending on the update frequency of model parameters, there are 3 types of gradient descent variants : Full-Batch, Stochastic, and Mini-Batch Gradient Descents.

In full-batch gradient descent, all the training examples are taken into account to make a single update of the model parameters.
The update trajectory is rather smooth because the moving direction of every single step is the mean gradient of the entire training examples.
Without parallelization, the computational cost for each independent parameter update iteration is O(n), with n being the total number of training example.
Full-batch gradient descent is not particularly data efficient whenever data is very similar.

In stochastic gradient descent, the model parameters are updated with just one example at a time.
Stochastic gradient descent is not particularly computationally efficient since CPUs and GPUs cannot exploit the full power of vectorization.
Stochastic gradient is an unbiased estimate of the full-batch gradient.
The trajectory of parameters fluctuate over the training examples, due to the stochastic nature of the gradient.
The noisy learning process could make it hard for the algorithm to settle down on a good solution / critical point, if the learning rate parameter η is not properly adjusted.

A time-dependent learning rate η(t) adds flexibility in controlling the convergence of an optimization algorithm. If η decays too quick, we are not making progress in optimization ; if η decays too slowly, we may end up wasting too much time letting the model jump around.
Few basic strategies of η(t) :

In mini-batch gradient descent, we randomly split the training examples into small batches, and use each mini-batch to estimate the update direction of model parameters.
Mini-batch gradient descent seeks a balance between the robustness and the noisy features of the full-batch and stochastic gradient descents, with the benefit of utilizing vectorized implementation for faster computations.
Batch-size is a hyperparameter you can tune in the training algorithm. It's usually set to be a power of 2 that fits the memory requirements of the GPU or CPU like 32, 64, 128, 256.
Smaller batch-size makes the training more data efficient at the cost of bringing larger stochastic.
Larger batch-size results in a more stable learning process.
Smaller batch-size typically has better generalization performance. The noisy oscillation feature makes the optimization algorithm more likely to settle down on a wide and flat plateau, rather than on a narrow canyon.