** Name ** Network in Network

**Intent**

Reduce computation cost by lowering the dimension of the representation inside the layer.

**Motivation**

How can we lower the computational costs of a layer?

**Structure**

<Diagram>

**Discussion**

**Known Uses**

**Related Patterns**

<Diagram>

**References**

Lin, M., Chen, Q., Yan, S.: Network in network. In: International Conference on Learning Representations. 2014 (2014)

Low-dimensional Embeddings. Lin et al. proposed a method to reduce the dimensionality of convolutional feature maps. By using relatively cheap ‘1×1’ convolutional layers (i.e. layers comprising d filters of size 1×1×c, where d < c), they learn to map feature maps into lower-dimensional spaces, i.e. to new feature maps with fewer channels. Subsequent spatial filters operating on this lower dimensional input space require significantly less computation. This method is used in most state of the art networks for image classification to reduce computation.

http://arxiv.org/abs/1312.4400 Network In Network

The conventional convolutional layer uses linear filters followed by a nonlinear activation function to scan the input. Instead, we build micro neural networks with more complex structures to abstract the data within the receptive field.

http://arxiv.org/pdf/1603.08029v1.pdf ResNet in ResNet

http://arxiv.org/abs/1603.06759v1 Conv. in Conv. Because of the powerfulness of MLP and 1×1 convolutions in spatial domain, NiN has stronger ability of feature representation and hence results in better recognition rate.

The architecture of CiC-1D. (a) Directly showing the role of MLP-010. (b) The kernels and their constraints for implementing CiC-1D. © The architecture and main steps of CiC-1D.

http://people.cs.uchicago.edu/~larsson/fractalnet/

https://www.nervanasys.com/winograd-2/

http://arxiv.org/abs/1605.07648v1 FractalNet: Ultra-Deep Neural Networks without Residuals

Repeated application of a single expansion rule generates an extremely deep network whose structural layout is precisely a truncated fractal. Such a network contains interacting subpaths of different lengths, but does not include any pass-through connections: every internal signal is transformed by a filter and nonlinearity before being seen by subsequent layers.

This property stands in stark contrast to the current approach of explicitly structuring very deep networks so that training is a residual learning problem.

A fractal design achieves an error rate of 22.85% on CIFAR-100, matching the state-of-the-art held by residual networks.

Fractal networks exhibit intriguing properties beyond their high performance. They can be regarded as a computationally efficient implicit union of subnetworks of every depth.

FractalNet demonstrates that path length is fundamental for training ultra-deep neural networks; residuals are incidental. Key is the shared characteristic of FractalNet and ResNet: large nominal network depth, but effectively shorter paths for gradient propagation during training. Fractal architectures are arguably the simplest means of satisfying this requirement, and match or exceed ResNet’s experimental performance. They are resistant to being too deep; extra depth may slow training, but does not impair accuracy.

With drop-path, regularization of extremely deep fractal networks is intuitive and effective. Drop-path doubles as a method of enforcing latency/accuracy tradeoffs within fractal networks, for applications where fast answers have utility.

Our analysis connects the emergent internal behavior of fractal networks with phenomena built into other designs. Their substructure is similar to hand-designed modules used as building blocks in some convolutional networks. Their training evolution may emulate deep supervision and student-teacher learning.

http://iamaaditya.github.io/2016/03/one-by-one-convolution/

http://openreview.net/pdf?id=ByZvfijeg HIGHER ORDER RECURRENT NEURAL NETWORKS

In this paper, we have proposed some new structures for recurrent neural networks, called as higher order RNNs (HORNNs). In these structures, we use more memory units to keep track of more preceding RNN states, which are all fed along various feedback paths to the hidden layer to generate the feedback signals. In this way, we may enhance the model to capture long term dependency in sequential data. Moreover, we have proposed to use several types of pooling functions to calibrate multiple feedback paths.