**This is an old revision of the document!**

# Shortcut Connection

**Aliases**

Residual Connection, Skip Connection, Passthrough Connection, Identity Parametrization

**Intent**

Create a path to a higher layer to preserve information

**Motivation**

A strict layer hierarchy may be too restrictive or hide information that needs to be surfaced to higher layers.

**Sketch**

*This section provides alternative descriptions of the pattern in the form of an illustration or alternative formal expression. By looking at the sketch a reader may quickly understand the essence of the pattern.
*
**Discussion**

A Passthrough connection was first highlighted by the Residual Network to achieve higher accuracies for the 2016 Imagenet competition. This lead to networks that were much deeper than previous networks. The clear distinguishing characteristic were connections from lower layers to higher layers that would bypass a layer or several layers. Residual Networks and other networks that have taken advantage of this technique have been shown to have higher prediction accuracy. This leads to the conclusion that Deep Networks should not necessarily maintain a strict hierarchy where information is always truncated as it flows from a lower to a higher layer (i.e. upstream). Rather, the added flexibility of flowing information that bypasses layers has a clear benefit. It seems that providing more pathways of information flow allows networks greater flexibility to more accurately model the domain.

**Known Uses**

*Here we review several projects or papers that have used this pattern.*

https://arxiv.org/pdf/1606.06582v1.pdf Augmenting Supervised Neural Networks with Unsupervised Objectives for Large-scale Image Classification

We proposed a simple and effective way to incorporate unsupervised objectives into large-scale classification network learning by augmenting the existing network with reconstructive decoding pathways. Using the resultant autoencoder for image reconstruction, we demonstrated the ability of preserving input information by intermediate representation as an important property of modern deep neural networks trained for large-scale image classification.

Residual

Highway

RNN/LSTM

**Related Patterns**
*
In this section we describe in a diagram how this pattern is conceptually related to other patterns. The relationships may be as precise or may be fuzzy, so we provide further explanation into the nature of the relationship. We also describe other patterns may not be conceptually related but work well in combination with this pattern.*

*Relationship to Canonical Patterns*

*Relationship to other Patterns*

**Further Reading**

*We provide here some additional external material that will help in exploring this pattern in more detail.*

**References**

*To aid in reading, we include sources that are referenced in the text in the pattern.*

**References**

http://arxiv.org/pdf/1603.03116v2.pdf LOW-RANK PASSTHROUGH NEURAL NETWORKS

Passthough networks can be defined as networks where the state transition function f has a special form such that, at each step t the state vector x(t) (or a sub-vector xˆ(t)) is propagated to the next step modified only by some (nearly) linear, element-wise transformations.

https://arxiv.org/abs/1610.10087v1 Tensor Switching Networks

The TS network copies its entire input vector to different locations in an expanded representation, with the location determined by its hidden unit activity. In this way, even a simple linear readout from the TS representation can implement a highly expressive deep-network-like function. The TS network hence avoids the vanishing gradient problem by construction, at the cost of larger representation size. We develop several methods to train the TS network, including equivalent kernels for infinitely wide and deep TS networks, a one-pass linear learning algorithm, and two backpropagation-inspired representation learning algorithms. Our experimental results demonstrate that the TS network is indeed more expressive and consistently learns faster than standard ReLU networks.

(Left) A single-hidden-layer standard (i.e. Scalar Switching) ReLU network. (Right) A single-hidden-layer Tensor Switching ReLU network, where each hidden unit conveys a vector of activities—inactive units (top-most unit) convey a vector of zeros while active units (bottom two units) convey a copy of their input.

http://openreview.net/pdf?id=SJZAb5cel A JOINT MANY-TASK MODEL: GROWING A NEURAL NETWORK FOR MULTIPLE NLP TASKS

We introduce such a joint many-task model together with a strategy for successively growing its depth to solve increasingly complex tasks. All layers include shortcut connections to both word representations and lower-level task predictions. We use a simple regularization term to allow for optimizing all model weights to improve one task’s loss without exhibiting catastrophic interference of the other tasks.

These results clearly show that the importance of the shortcut connections in our JMT model, and in particular, the semantic tasks in the higher layers strongly rely on the shortcut connections. That is, simply stacking the LSTM layers is not sufficient to handle a variety of NLP tasks in a single model.

https://arxiv.org/abs/1611.01186 Demystifying ResNet

We provide a theoretical explanation for the superb performance of ResNet via the study of deep linear networks and some nonlinear variants. We show that with or without nonlinearities, by adding shortcuts that have depth two, the condition number of the Hessian of the loss function at the zero initial point is depth-invariant, which makes training very deep models no more difficult than shallow ones. Shortcuts of higher depth result in an extremely flat (high-order) stationary point initially, from which the optimization algorithm is hard to escape. The 1-shortcut, however, is essentially equivalent to no shortcuts. Extensive experiments are provided accompanying our theoretical results. We show that **initializing the network to small weights with 2-shortcuts achieves significantly better results than random Gaussian (Xavier) initialization, orthogonal initialization, and shortcuts of deeper depth**, from various perspectives ranging from final loss, learning dynamics and stability, to the behavior of the Hessian along the learning process.

Equivalents of two extremes of n-shortcut linear networks. 1-shortcut linear networks are equivalent to linear networks with identity initialization, while skip-all shortcuts will only change the effective dataset outputs.

https://openreview.net/pdf?id=ryxB0Rtxx IDENTITY MATTERS IN DEEP LEARNING

An emerging design principle in deep learning is that each layer of a deep artificial neural network should be able to easily express the identity transformation. This idea not only motivated various normalization techniques, such as batch normalization, but was also key to the immense success of residual networks.

In this work, we put the principle of identity parameterization on a more solid theoretical footing alongside further empirical progress. We first give a strikingly simple proof that arbitrarily deep linear residual networks have no spurious local optima. The same result for feed-forward networks in their standard parameterization is substantially more delicate. Second, we show that residual networks with ReLu activations have universal finite-sample expressivity in the sense that the network can represent any function of its sample provided that the model has more parameters than the sample size.

https://www.semanticscholar.org/paper/Deep-Pyramidal-Residual-Networks-Han-Kim/5bdf07c9897ca70788fff61dec56178a2bd0c29c Deep Pyramidal Residual Networks

In this research, instead of using downsampling to achieve a sharp increase at each residual unit, we gradually increase the feature map dimension at all the units to involve as many locations as possible. This is discussed in depth together with our new insights as it has proven to be an effective design to improve the generalization ability. Furthermore, we propose a novel residual unit capable of further improving the classification accuracy with our new network architecture.

https://arxiv.org/pdf/1611.06612v1.pdf RefineNet: Multi-Path Refinement Networks with Identity Mappings for High-Resolution Semantic Segmentation

Repeated subsampling operations like pooling or convolution striding in deep CNNs lead to a significant decrease in the initial image resolution. Here, we present RefineNet, a generic multi-path refinement network that explicitly exploits all the information available along the down-sampling process to enable high-resolution prediction using long-range residual connections. In this way, the deeper layers that capture high-level semantic features can be directly refined using fine-grained features from earlier convolutions. The individual components of RefineNet employ residual connections following the identity mapping mindset, which allows for effective end-to-end training. Further, we introduce chained residual pooling, which captures rich background context in an efficient manner. We carry out comprehensive experiments and set new state-of-the-art results on seven public datasets. In particular, we achieve an intersection-over-union score of 83.4 on the challenging PASCAL VOC 2012 dataset, which is the best reported result to date.

https://www.semanticscholar.org/paper/Densely-Connected-Convolutional-Networks-Huang-Liu/b73ea8c4cab76ad02fa06568c6f289b6fe038e27 Densely Connected Convolutional Networks

We introduce the Dense Convolutional Network (DenseNet), where each layer is directly connected to every other layer in a feed-forward fashion. For each layer, the feature maps of all preceding layers are treated as separate inputs whereas its own feature maps are passed on as inputs to all subsequent layers. Our proposed connectivity pattern has several compelling advantages: it alleviates the vanishing gradient problem and strengthens feature propagation; despite the increase in connections, it encourages feature reuse and leads to a substantial reduction of parameters; its models tend to generalize surprisingly well. We evaluate our proposed architecture on five highly competitive object recognition benchmark tasks. The DenseNet obtains significant improvements over the state-of-the-art on all five of them (e.g., yielding 3.74% test error on CIFAR-10, 19.25% on CIFAR-100 and 1.59% on SVHN).

https://openreview.net/pdf?id=r1Ue8Hcxg NEURAL ARCHITECTURE SEARCH WITH REINFORCEMENT LEARNING

In our framework, if one layer has many input layers then all input layers are concatenated in the depth dimension. Skip connections can cause ”compilation failures” where one layer is not compatible with another layer, or one layer may not have any input or output. To circumvent these issues, we employ three simple techniques. First, if a layer is not connected to any input layer then the image is used as the input layer. Second, at the final layer we take all layer outputs that have not been connected and concatenate them before sending this final hidden state to the classifier. Lastly, if input layers to be concatenated have different sizes, we pad the small layers with zeros so that the concatenated layers have the same sizes.

https://www.semanticscholar.org/paper/U-Net-Convolutional-Networks-for-Biomedical-Image-Ronneberger-Fischer/07045f87709d0b7b998794e9fa912c0aba912281 U-Net: Convolutional Networks for Biomedical Image Segmentation

https://arxiv.org/abs/1612.06851 Beyond Skip Connections: Top-Down Modulation for Object Detection

https://arxiv.org/abs/1702.08591v1 The Shattered Gradients Problem: If resnets are the answer, then what is the question?

A long-standing obstacle to progress in deep learning is the problem of vanishing and exploding gradients. The problem has largely been overcome through the introduction of carefully constructed initializations and batch normalization. Nevertheless, architectures incorporating skip-connections such as resnets perform much better than standard feedforward architectures despite well-chosen initialization and batch normalization. In this paper, we identify the shattered gradients problem. Specifically, we show that the correlation between gradients in standard feedforward networks decays exponentially with depth resulting in gradients that resemble white noise. In contrast, the gradients in architectures with skip-connections are far more resistant to shattering decaying sublinearly. Detailed empirical evidence is presented in support of the analysis, on both fully-connected networks and convnets. Finally, we present a new “looks linear” (LL) initialization that prevents shattering. Preliminary experiments show the new initialization allows to train very deep networks without the addition of skip-connections.

https://arxiv.org/pdf/1703.06408v1.pdf Multilevel Context Representation for Improving Object Recognition

This paper postulates that the use of context closer to the high-level layers provides the scale and translation invariance and works better than using the top layer only.

Also, it is shown that at almost no additional cost, the relative error rates of the original networks decrease by up to 2%. This fact makes the extended networks a very well suited choice for usage in production environments. The quantitative evaluation signifies that the new approach could be, at inference time, 144 times more efficient than the current approaches while maintaining comparable performance.

Unlike most CNNs, including AlexNet and GoogLeNet, the proposed networks feed the classification part of the network with information not only from the highest-level convolutional layer, but with information from the two highest-level convolutional layers. We call the enhanced versions of these networks AlexNet++ and GoogLeNet++.

https://arxiv.org/abs/1705.07485 https://github.com/xgastaldi/shake-shake

The method introduced in this paper aims at helping deep learning practitioners faced with an overfit problem. The idea is to replace, in a multi-branch network, the standard summation of parallel branches with a stochastic affine combination. Applied to 3-branch residual networks, shake-shake regularization improves on the best single shot published results on CIFAR-10 and CIFAR-100 by reaching test errors of 2.86% and 15.85%. Experiments on architectures without skip connections or Batch Normalization show encouraging results and open the door to a large set of applications. Code is available at this https URL.

https://arxiv.org/abs/1706.05744 Learning Hierarchical Information Flow with Recurrent Neural Modules

We propose a deep learning model inspired by neocortical communication via the thalamus. Our model consists of recurrent neural modules that send features via a routing center, endowing the modules with the flexibility to share features over multiple time steps. We show that our model learns to route information hierarchically, processing input data by a chain of modules. We observe common architectures, such as feed forward neural networks and skip connections, emerging as special cases of our architecture, while novel connectivity patterns are learned for the text8 compression task. We demonstrate that our model outperforms standard recurrent neural networks on three sequential benchmarks.