Representational bottlenecks in deep learning
In a Sequential model, each successive representation layer on top of the previous one, it only has access to contained activation of the previous layer. If one layer small (for example, features are too low-dimensional), then the model constrained by information crammed into the activations of this layer.
We can grasp with a signal-processing analogy: if we havean audio-processing pipeline that consists of a series of operations, each of which takes as input the output of the previous operation, then if one operation crops our signal to a low-frequency range (for example, 0–15 kHz), the operations downstream will never be recover the dropped frequencies. Any loss is permanent. Residual connections, by reinjecting earlier information downstream, partially solve this issue for deep-learning models.
Backpropagation, the master algorithm train deep neural networks, works by propagating a feedback signal from the output loss earlier layers. If this feedback signal be propagated through a deep stack of layers, the signal may become tenuous be lost entirely, rendering the network untrainable. This issue as vanishing gradients. This problem occurs both with deep networks and with recurrent networks over very long sequences—in both cases, a feedback signal must be propagated through series of operations. We’re already that the LSTM layer uses this problem in recurrent networks: it introduces a carry track that propagates information parallel to processing track. Residual connections way in feedforward deep networks, but they’re even simpler: they introduce purely linear information carry track parallel to layer stack, thus helping to propagate gradients through arbitrarily deep stacks of layers.
The vanishing gradients problem is one example of unstable behavior may encounter when training a deep neural network. It describes where a deep multilayer feed-forward network or a recurrent neural network is unable to propagate useful gradient information from the output end of the model back to the layers near the input end of the model. The inability of models with many layers on a given dataset, or for models with many layers to prematurely converge to a poor solution.
Many fixes and workarounds proposed and investigated, alternate weight initialization schemes, unsupervised pre-training, layer-wise training, and variations on gradient descent. Perhaps common change use of the rectified linear activation function that has become the new default, the hyperbolic tangent activation function that was the default through the late 1990s and 2000s.