-
Notifications
You must be signed in to change notification settings - Fork 0
0.2 Feedforward networks
An approach to speeding up learning is to exploit parallel computation. In particular, methods for training networks through asynchronous gradient updates have been developed for use on both single machines and distributed systems. By keeping a canonical set of parameters that are read by and updated in an asynchronous fashion by multiple copies of a single network, computation can be efficiently distributed over both processing cores in a single CPU, and across CPUs in a cluster of machines.
To avoid recomputations, we an think of back-propagation as a table-filling algorithm that takes advantage of storing intermediate results. Each node in the graph has a corresponding slot in a table to store the gradient for that node. By filling in these table entries in order, back-propagation avoids repeating many common subexpressions. This table-filling strategy is sometimes called dynamic programming.
The basic idea of the back propagation method of learning is to combine a nonlinear perceptron-like system capable of making decisions with the objective error function of LMS and gradient descent.
We will not bother with the mathematics here, since it is presented elsewhere in detail.
Suffice to say, that with an appropriate choice of non-linear function we can perform the differentiation and derive the back propagation learning rule.
The application of the back propagation rule, then, involves two phases:
During the first phase the input is presented and propagated forward through the network to compute the output value for each unit.
This output is then compared with the target, resulting in a term for each output unit.
The second phase involves a backward pass through the network (analogous to the initial forward pass) during which the term is computed for each unit in the network. This second, backwards pass allows the recursive computation of the term as indicated above.
Once these two phases are complete, we can compute, for each weight, the product of the term associated with the unit it projects to times the activation of the unit it projects from.
Henceforth we will call this product the weight error derivative since it is proportional to (minus) the derivative of the error with respect to the weight. These weight error derivatives can then be used to compute actual weight changes on a pattern-by-pattern basis, or they may be accumulated over the ensemble of patterns.
The deep learning community has been somewhat isolated from the broader computer science community and has largely developed its own cultural attitudes concerning how to perform differentiation. More generally, the field of automatic differentiation is concerned with how to compute derivatives algorithmically.
The back-propagation algorithm described here is only one approach to automatic differentiation. It is a special case of a broader class of techniques called reverse mode accumulation. Back-propagation is therefore not the only way or the optimal way of computing the gradient, but it is a practical method that continues to server the deep learning community well. In the future, differentiation technology for neural networks may improve as deep learning practitioners become more aware of advances in the broader field of automatic differentiation.
A feedforward neural network is an artificial neural network wherein connections between the nodes do not form a cycle. As such, it is different from recurrent neural networks. In this network, the information moves in only one direction, forward, from the input nodes, through the hidden nodes (if any) and to the output nodes. There are no cycles or loops in the network.
Feedforward networks can be seen as efficient nonlinear function approximators based on using gradient descent to minimize the error in a function approximation.
The chain rule that underlies back-propagation algorithm was invented in the seventeenth century. Calculus and algebra have long been used to solve optimization problems in closed form, but gradient descent was not introduced as a technique for iteratively approximating the solution to optimization problems until the nineteenth century.
Inspired by Hebb in the 1940's these function approximation techniques were used to motivate machine learning models such as the perceptron. However, the earliest models wee based on linear models.
Learning nonlinear functions required the development of a multilayer perceptron and a means of computing the gradient through such a model. Efficient applications of the chain rule based on dynamic programming began to appear in the 1960s and 1970s, mostly for control applications but also for sensitivity analysis. Werbos 1981 proposed applying these techniques to training artificial neural networks. The idea was finally developed in practice after being independently rediscovered in different ways (LeCun, 1985; Parker, 1985; Rumelhart at al., 1986a). The book Parallel Distributed Processing presented the results of some of the first successful experiments with back-propagation in a chapter (Rumelhart et al., 1986b) contributed greatly to the popularization of back-propagation and initiated a very active period of research in multilayer neural networks.
The core ideas behind modern feedforward networks have not changed substantially since the 1980s. The same back-propagation algorithm and the same approaches to gradient descent are still in use.
A small number of algorithmic changes have also improved the performance of neural networks noticeably. One of these algorithmic changes was the replacement of mean squared error with the cross-entropy family of loss functions. Mean squared error was popular in the 1980s and 1990s but gradually replaced by cross-entropy losses and the principle of maximum likelihood as ideas spread between the statistics community and the machine learning community. The use of cross-entropy losses greatly improved the performance of models with sigmoid and softmax outputs, which had previously suffered from saturation and slow learning when using the mean squared error loss.
The other major algorithmic change that has greatly improved the performance of feedforward networks was the replacement of sigmoid hidden units with piece wise leaner hidden units, such as rectified linear units. As of the early 2000s, rectified linear units were avoided because of a somewhat superstitious belief that activation funtions with nondifferentiable points must be avoided. This began to change in about 2009. Jarrett at al. (2009) observed that "using a rectifying nonlinearity is the single most important factor in improving the performance of a recognition system," among several different factors of neural networks architecture design.
Today, gradient-based learning in feedforward networks is used as a tool to develop probabilistic models, rather than being viewed as an unreliable technology that must be supported by other techniques, gradient-based learning in feedforward networks has been viewed since 2012 as a powerful technology that can be applied to many other machine learning tasks.
In 2006, the community used unsupervised learning to support supervised learning, and now, ironically, it is more common to use supervised learning to support unsupervised learning.
A powerful method of regulating a broad family of models. To a first approximation, dropout can be thought of as a method of making nagging practical for ensembles of very many large neural networks.
Nagging involves training multiple models and evaluating multiple models on each test example.
Dropout provides an inexpensive approximation to training and evaluating a bagged ensemble of exponentially many neural networks.
Specifically, to train with dropout we use a minivan-based learning algorithm that makes small steps, such as stochastic gradient descent.
Each time we load an example into a minibatch, we randomly sample a different binary mask to apply to all the input and hidden units in the network.
So far we have described dropout purely as a means of performing efficient, approximation bagging.
We like another view of dropout that goes further than this.
Dropout thus regularizes each hidden unit to be merely a good feature but a feature that is good in many contexts.
Wards-Farley et al (2014) compared dropout training to training of large ensembles and concluded that dropout offers additional improvements to generalization error beyond those obtained by ensembles of independent models.