**This is an old revision of the document!**

Residual Connection / Layer

Intent

Train a network to approximate the difference from the original input rather than a complete mapping.

Problem

We would like to create extremely deep networks to improve the accuracy of its predictions.

Structure

**Discussion**

Residual Networks are important because (1) they have shown superior performance in ImageNet and (2) they have shown that you can create extremely deep layers of neural networks. The first result is an indicator of the value of pass through network elements. The second result has ramifications also in recurrent networks because RNNs are implicitly deep.

The first concept to understand is the notion of hierarchical composition in Deep Learning. Deep Learning achieves higher expressibility through a hierarchy of layers. So it assumed that higher and higher abstractions are created per layer of the network. So for example in a image processing neural network, at the bottom we may have a layer that recognizes simple lines. Then at the next layer we recognize composition of these lines and at higher layers we begin to recognize much higher features such as eyes and noses.

The problem with strict hierarchical composition is that we make too big an assumption that each layer only needs the information of the layer previous or adjacent to it. However, perhaps a layer needs information not only the layer previous but from many other layers that it is stacked on top of. To ensure a minimum loss of information from any of the lower layers, we add passthrough routing so that layers receive more detailed information rather than just abstract information.

Experimentally, several layers of residual network layers seem to do the same kind of recognition of a single layer of a more conventional layer. The intuition of why a residual network works better is that it preserves information across layers. The intuition is also similar to the intuition as to why ReLU works so well. In ReLU, unlike sigmoid and tanh activation functions, the linear regime preserves more information.

References

https://arxiv.org/abs/1512.03385 Deep Residual Learning for Image Recognition

Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.

https://arxiv.org/abs/1604.03640 Bridging the Gaps Between Residual Learning, Recurrent Neural Networks and Visual Cortex We begin with the observation that a shallow RNN is exactly equivalent to a very deep ResNet with weight sharing among the layers.

A ResNet can be reformulated into a recurrent form that is almost identical to a conventional RNN.

http://arxiv.org/abs/1605.06431 Residual networks are not single ultra-deep networks, but instead comprise implicit ensembles of exponentially many networks. Likewise, we introduce the notion of multiplicity, which captures the size of the implicit ensemble.

Residual networks behave just like ensembles at test time.

The implicit ensembles mostly consist of networks that are each individually relatively shallow.

http://arxiv.org/pdf/1606.05262v1.pdf Convolutional Residual Memory Networks

http://kaiminghe.com/icml16tutorial/icml2016_tutorial_deep_residual_networks_kaiminghe.pdf

https://github.com/KaimingHe/deep-residual-networks

http://arxiv.org/abs/1602.07261v2 Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning

Here we give clear empirical evidence that training with **residual connections accelerates the training of Inception networks significantly.** There is also some evidence of residual Inception networks outperforming similarly expensive Inception networks without residual connections by a thin margin. We also present several new streamlined architectures for both residual and non-residual Inception networks. These variations improve the single-frame recognition performance on the ILSVRC 2012 classification task significantly. We further demonstrate how proper activation scaling stabilizes the training of very wide residual Inception networks.

We found that scaling down the residuals before adding them to the previous layer activation seemed to stabilize the training. In general we picked some scaling factors between 0.1 and 0.3 to scale the residuals before their being added to the accumulated layer activations.

https://arxiv.org/pdf/1609.05672v1.pdf Multi-Residual Networks

The multi-residual networks increase the number of residual functions in the residual blocks. This is shown to improve the accuracy of the residual network when the network is deeper than a threshold.

A residual block (left) versus a multi-residual block (right).

http://openreview.net/pdf?id=Sk8csP5ex THE LOSS SURFACE OF RESIDUAL NETWORKS: ENSEMBLES & THE ROLE OF BATCH NORMALIZATION

Ensembles are a powerful model for ResNets, which unravels some of the key questions that have surrounded ResNets since their introduction. Here, we show that ResNets display a dynamic ensemble behavior, which explains the ease of training such networks even at very large depths, while still maintaining the advantage of depth. As far as we know, the dynamic behavior of the effective capacity is unlike anything documented in the deep learning literature. Surprisingly, the dynamic mechanism typically takes place within the outer multiplicative factor of the batch normalization module.

https://blogs.princeton.edu/imabandit/2016/11/13/geometry-of-linearized-neural-networks/

https://arxiv.org/abs/1611.01186 Demystifying ResNet

We provide a theoretical explanation for the superb performance of ResNet via the study of deep linear networks and some nonlinear variants. We show that with or without nonlinearities, by adding shortcuts that have depth two, the condition number of the Hessian of the loss function at the zero initial point is depth-invariant, which makes training very deep models no more difficult than shallow ones. Shortcuts of higher depth result in an extremely flat (high-order) stationary point initially, from which the optimization algorithm is hard to escape. The 1-shortcut, however, is essentially equivalent to no shortcuts. Extensive experiments are provided accompanying our theoretical results. We show that initializing the network to small weights with 2-shortcuts achieves significantly better results than random Gaussian (Xavier) initialization, orthogonal initialization, and shortcuts of deeper depth, from various perspectives ranging from final loss, learning dynamics and stability, to the behavior of the Hessian along the learning process.

Equivalents of two extremes of n-shortcut linear networks. 1-shortcut linear networks are equivalent to linear networks with identity initialization, while skip-all shortcuts will only change the effective dataset outputs.

https://arxiv.org/abs/1605.07146v2 Wide Residual Networks

We demonstrate that even a simple 16-layer-deep wide residual network outperforms in accuracy and efficiency all previous deep residual networks, including thousand-layer-deep networks, achieving new state-of-the-art results on CIFAR, SVHN, COCO, and significant improvements on ImageNet.

https://arxiv.org/abs/1611.05431 Aggregated Residual Transformations for Deep Neural Networks

We present a simple, highly modularized network architecture for image classification. Our network is constructed by repeating a building block that aggregates a set of transformations with the same topology. Our simple design results in a homogeneous, multi-branch architecture that has only a few hyper-parameters to set. This strategy exposes a new dimension, which we call “cardinality” (the size of the set of transformations), as an essential factor in addition to the dimensions of depth and width. On the ImageNet-1K dataset, we empirically show that even under the restricted condition of maintaining complexity, increasing cardinality is able to improve classification accuracy. Moreover, increasing cardinality is more effective than going deeper or wider when we increase the capacity. Our models, codenamed ResNeXt, are the foundations of our entry to the ILSVRC 2016 classification task in which we secured 2nd place. We further investigate ResNeXt on an ImageNet-5K set and the COCO detection set, also showing better results than its ResNet counterpart.

https://arxiv.org/pdf/1612.07771v1.pdf Highway and Residual Networks learn Unrolled Iterative Estimation

While depth of representation has been posited as a primary reason for their success, there are indications that these architectures defy a popular view of deep learning as a hierarchical computation of increasingly abstract features at each layer.

In this report, we argue that this view is incomplete and does not adequately explain several recent findings. We propose an alternative viewpoint based on unrolled iterative estimation—a group of successive layers iteratively refine their estimates of the same features instead of computing an entirely new representation. We demonstrate that this viewpoint directly leads to the construction of Highway and Residual networks. Finally we provide preliminary experiments to discuss the similarities and differences between the two architectures.

This paper offers a new perspective on Highway and Residual networks as performing unrolled iterative estimation. As an extension of the popular representation view, it stands in contrast to the optimization perspective from which these architectures have originally been introduced. According to the new view, successive layers (within a stage) cooperate to compute a single level of representation. Therefore, the first layer already computes a rough estimate of that representation, which is then iteratively refined by the successive layers. Unlike layers in a conventional neural network, which each compute a new representation, these layers therefore preserve feature identity.

ified theory from which these architectures can be understood as two approaches to the same problem. This view further provides a framework from which to understand several surprising recent findings like resilience to lesioning, benefits of layer dropout, and the mild negative effects of layer reshuffling. Together with the derivations these results serve as compelling evidence for the validity of our new perspective.

We found non-gated identity skip-connections to perform significantly worse, and offered a possible explanation: If the task requires dynamically replacing individual features, then the use of gating is beneficial.

https://arxiv.org/abs/1702.08591v1 The Shattered Gradients Problem: If resnets are the answer, then what is the question?

A long-standing obstacle to progress in deep learning is the problem of vanishing and exploding gradients. The problem has largely been overcome through the introduction of carefully constructed initializations and batch normalization. Nevertheless, architectures incorporating skip-connections such as resnets perform much better than standard feedforward architectures despite well-chosen initialization and batch normalization. In this paper, we identify the shattered gradients problem. Specifically, we show that the correlation between gradients in standard feedforward networks decays exponentially with depth resulting in gradients that resemble white noise. In contrast, the gradients in architectures with skip-connections are far more resistant to shattering decaying sublinearly. Detailed empirical evidence is presented in support of the analysis, on both fully-connected networks and convnets. Finally, we present a new “looks linear” (LL) initialization that prevents shattering. Preliminary experiments show the new initialization allows to train very deep networks without the addition of skip-connections.

https://arxiv.org/abs/1703.06846v1 Boosting Dilated Convolutional Networks with Mixed Tensor Decompositions

In this paper we study the expressive efficiency brought forth by the architectural feature of connectivity, motivated by the observation that nearly all state of the art networks these days employ elaborate connection schemes, running layers in parallel while splitting and merging them in various ways. A formal treatment of this question would shed light on the effectiveness of modern connectivity schemes, and in addition, could provide new tools for network design. We focus on dilated convolutional networks, a family of deep models gaining increased attention, underlying state of the art architectures like Google's WaveNet and ByteNet. By introducing and studying the concept of mixed tensor decompositions, we prove that interconnecting dilated convolutional networks can lead to expressive efficiency. In particular, we show that a single connection between intermediate layers can already lead to an almost quadratic gap, which in large-scale settings typically makes the difference between a model that is practical and one that is not.

https://arxiv.org/pdf/1710.10348v1.pdf MULTI-LEVEL RESIDUAL NETWORKS FROM DYNAMICAL SYSTEMS VIEW

https://arxiv.org/pdf/1709.01507.pdf Squeeze-and-Excitation Networks

In this work, we focus on channels and propose a novel architectural unit, which we term the “Squeeze-and-Excitation”(SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels.

https://arxiv.org/pdf/1711.07971.pdf Non-local Neural Networks

In this paper, we present non-local operations as a generic family of building blocks for capturing long-range dependencies. Inspired by the classical non-local means method [4] in computer vision, our non-local operation computes the response at a position as a weighted sum of the features at all positions. This building block can be plugged into many computer vision architectures.