-
Batch normalization (BatchNorm) is a technique applied to a network that standardizes the outputs of individual hidden layers for each mini-batch. This procedure makes a network more efficient and more stable to train.
-
Intuition
- When training a network, most of the time we preprocess the raw data by transforming the input features to similar scales (e.g. normalization or rescaling to [-1,1]) before feeding it into our model. This step is shown to dramatically improves modeling results.
- Similarly, the outputs of intermediate layers may take values with widely varying magnitudes. By normalizing these intermediate outputs before passing them to the next layer, the internal parameter distributions tend to be more stable, with less scatter with respect to different inputs, making the networks easier to train (i.e. reduce the so-called “internal covariate shift” (ICS) as explained by the original BatchNorm article).
-
Note : The exact reasonings for why BatchNorm works is still under debate. There are works claimed that the reduction in internal covariate shift (ICS) is not the primary mechanism for explaining BatchNorm's effectiveness.
- Santurkar et al. 2018 claimed that "BatchNorm makes the optimization landscape significantly smoother. This smoothness induces a more predictive and stable behavior of the gradients, allowing for faster training."
- However, Awais et al. 2021 showed that the reduction of ICS still remains as an important factor for explaining BatchNorm's performance.
-
When applying batch normalization, the choice of batch size may be even more significant than without batch normalization. The larger the batch size, the more effective of the effect of batch normalization.
-
Code demo on training a LeNet with v.s. without BatchNorm on FashionMNIST dataset.
-
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift ( Ioffe & Szegedy 2015. arXiv:1502.03167)