### Introduction

In a Sequential model, each successive representation layer is made on top of the previous one, which suggests it only has access to the knowledge contained within the activation of the previous layer. If one layer is just too small (for example, its features are too low-dimensional), then the model is going to be constrained by what proportion of information is often crammed into the activations of this layer.

### Description

We can grasp this idea with a signal-processing analogy: if we have an audio-processing pipeline that consists of a series of operations, each of which takes as input the output of the previous operation, then if one operation crops our signal to a low-frequency range (for example, 0–15 kHz), the operations downstream will never be ready to recover the dropped frequencies. Any loss of data is permanent. Residual connections, by reinjecting earlier information downstream, partially solve this issue for deep-learning models.

### Vanishing gradients

Backpropagation, the master algorithm wont to train deep neural networks, works by propagating a feedback signal from the output loss right down to earlier layers. If this feedback signal has got to be propagated through a deep stack of layers, the signal may become tenuous or maybe be lost entirely, rendering the network untrainable. This issue is understood as vanishing gradients. This problem occurs both with deep networks and with recurrent networks over very long sequences—in both cases, a feedback signal must be propagated through an extended series of operations. We’re already conversant in the answer that the LSTM layer uses to deal with this problem in recurrent networks: it introduces a carry track that propagates information parallel to the most processing track. Residual connections add an identical way in feedforward deep networks, but they’re even simpler: they introduce purely linear information-carrying track parallel to the most layer stack, thus helping to propagate gradients through arbitrarily deep stacks of layers.

The vanishing gradients problem is one example of unstable behavior that you simply may encounter when training a deep neural network. It describes things where a deep multilayer feed-forward network or a recurrent neural network is unable to propagate useful gradient information from the output end of the model back to the layers near the input end of the model. The result’s the overall inability of models with many layers to find out on a given dataset, or for models with many layers to prematurely converge to a poor solution.

Many fixes and workarounds are proposed and investigated, like alternate weight initialization schemes, unsupervised pre-training, layer-wise training, and variations on gradient descent. Perhaps the foremost common change is that the use of the rectified linear activation function that has become the new default, rather than the hyperbolic tangent activation function that was the default through the late 1990s and 2000s.