Dropout: A Simple Way to Prevent Neural Networks from Overfitting -- Srivastava et al. 2014
-
During training, randomly drop hidden units (along with their connections) on some probability p from the neural network on each training iteration.
-
Debiases each layer by normalizing by the fraction of nodes that were retained (1-p).
For a given layer, with dropout probability p, each intermediate activation h is replaced by a random variable h' as follows:
By design, the expectation remains unchanged, i.e., E[h'] = h. -
Can apply higher dropout probability to layers with more hidden units.
-
At test or validation time, turn off dropout. (We don't want to add noise at predictions).
-
Dropout breaks co-adaption among neurons.
Co-adaptation means that some neurons are highly dependent on others. If one of the neurons receive bad inputs, then the dependent neurons can be affected as well. -
Model weights are more motivated to spread out across many hidden units, rather than depending too much on a small number of potentially spurious associations.
-
Dropout prevents the network to overly rely on using a small set neurons with large weights to generate outputs. In other words, it encourages small weights among neurons, which is similar to the effect of applying L2 regularization.
-
Test error for different architectures with and without dropout. From Fig.4 of Srivastava et al. 2014
- Hidden layers trained with dropout tend to be able to capture more meaningful features.