Hierarchical Composition

Alias Mulitlayer Architecture, Deep Architecture


Learn a representation where higher level abstractions are built recursively from lower level abstractions.


How can we build networks with multiple layers to improve expressivity and generalization?

A Deep Neural Network is capable of learning a Distributed Representation across multiple layers. However, how does it happen that one layer is an abstraction of the layer below it?




A network can be composed with multiple non-linear layers forming a hierarchy from the input to the output classification.

In Yoshua Bengio's paper “Deep Learning of Representation”, he defines a Deep Learning algorithm as follows:

A representation learning algorithm discovers explanatory factors or features. A deep learning algorithm is a particular kind of representation learning procedure that discovers multiple levels of representation, with higher-level features representing more abstract aspects of the data.

Multiple layers of a deep neural network leads to multiple representations for each layer. In a Convolution Network where the output of each layers is pooled or averaged out, the consequence of which leads to different representations with different scales. Where the ConvNet has more fine grained representations at the lowest layers, however at each higher level there is a new representation that is created that corresponds to features that exist at a broader scale than the layer below.

There are several research papers that show that deeper networks can capture representations that cannot be captures by more shallow networks.


In this paper we are able to demonstrate empirically that shallow models can, at least in principle, learn more accurate functions without a large increase in the number of parameters. The algorithm we use to do this—training the shallow model to mimic a more accurate deep model, however, is awkward. It depends on the availability of either a large unlabeled dataset (to reduce the gap between teacher and mimic model) or a teacher model of very high accuracy, or both. Developing algorithms to train shallow models of high accuracy directly from the original data without going through the intermediate teacher model would, if possible, be a significant contribution.

There is research that shows the FCN can in fact be squashed however CNN cannot.

Known Uses

Related Patterns


Relationship to Canonical Patterns

  • Irreversibility and Merge in combination link adjacent layers.
  • Dissipative Adaptation provides the evolutionary framework to explain the spontaneous formation of cliques of classifiers within a layer or across a layer (with passthrough routing).
  • Ensembles are connected to each other in a hierarchical manner and enhancing through composition of features of a lower layer.
  • Modularity is present in that trained lower layers can be reused in other contexts.

Cited by these patterns:

References The Power of Depth for Feedforward Neural Networks

We show that there is a simple (approximately radial) function on R d , expressible by a small 3-layer feedforward neural networks, which cannot be approximated by any 2-layer network, to more than a certain constant accuracy, unless its width is exponential in the dimension. The result holds for virtually all known activation functions, including rectified linear units, sigmoids and thresholds, and formally demonstrates that depth – even if increased by 1 – can be exponentially more valuable than width for standard feedforward neural networks. Moreover, compared to related results in the context of Boolean functions, our result requires fewer assumptions, and the proof techniques and construction are very different Benefits of depth in neural networks

1. Functions with few oscillations poorly approximate functions with many oscillations. 2. Functions computed by networks with few layers must have few oscillations. 3. Functions computed by networks with many layers can have many oscillations. On the Depth of Deep Neural Networks: A Theoretical View

We show that deeper networks tend to have larger representation power (measured by Betti numbers based complexity) than shallower networks in multi-class setting. Degrees of Freedom in Deep Neural Networks

Further, we observe that for fixed number of parameters, deeper networks have less degrees of freedom exhibiting a regularization-by-depth.

Degrees of freedom estimates for models trained on Synthetic data, MNIST and CIFAR-10 under different regularizations. The lines represent the degrees of freedom estimate.

This note provides a family of classification problems, indexed by a positive integer k, where all shallow networks with fewer than exponentially (in k) many nodes exhibit error at least 1/6, whereas a deep network with 2 nodes in each of 2k layers achieves zero error, as does a recurrent network with 3 distinct nodes iterated k times. The proof is elementary, and the networks are standard feedforward networks with ReLU (Rectified Linear Unit) nonlinearities. Deep Learning Layer-wise Learning of Feature Hierarchies

Hierarchical neural networks for object recognition have a long history. In recent years, novel methods for incrementally learning a hierarchy of features from unlabeled inputs were proposed as good starting point for supervised training. These deep learning methods— together with the advances of parallel computers—made it possible to successfully attack problems that were not practical before, in terms of depth and input size. In this article, we introduce the reader to the basic concepts of deep learning, discuss selected methods in detail, and present application examples from computer vision and speech recognition. Deep Learning and the Renormalization Group An exact mapping between the Variational Renormalization Group and Deep Learning WHY DOES DEEP LEARNING WORK? - A PERSPECTIVE FROM GROUP THEORY

The intuition naturally extends to a many-layer scenario. Each trainee hidden layer ends up finding a feature with a big stabilizer. But at depth > 1, the inputs no longer inhabit the same space as the training samples, and this is where the next magic happens. A “simple” feature over this new space, actually corresponds to a more complex shape in the space of input samples. This process recurs into deeper layers, and thus emerge higher-order representations.

In that sense the orbit-stabilizer principle is a re-normalizable theory - it allows for the exact same coarse graining operation at every layer - namely, keeping only minimal orbit shapes and then passing them as new parameters for the next layer - and the theory remains unchanged at every scale.

A deep network constructs higher order moduli spaces. It learns figures having increasingly smaller symmetry groups; which corresponds to increasing complexity. A Group Theoretic Perspective on Unsupervised Deep Learning

All experiments to date suggest that a neural network is likely to learn the edges. But why? To answer this, imagine that the space of the autoencoders (viewed as transformations of the input) form a group. A batch of learning iterations stops whenever a stabilizer is found. Roughly speaking, if the search is a Markov chain (or a guided chain such as MCMC), then the bigger a stabilizer, the earlier it will be hit. The group structure implies that this big stabilizer corresponds to a small orbit. Now intuition suggests that the simpler a feature, the smaller is its orbit. For example, a line-segment generates many fewer possible shapes under linear deformations than a flower-like shape. An autoencoder then should learn these simpler features first, which falls in line with most experiments (see Lee et al. (2009)).

The intuition naturally extends to a many-layer scenario. Each hidden layer finding a feature with a big stabilizer. But beyond the first level, the inputs no longer inhabit the same space as the training samples. A “simple” feature over this new space actually corresponds to a more complex shape in the space of input samples. This process repeats as the number of layers increases. In effect, each layer learns “edge-like features” with respect to the previous layer, and from these locally simple representations we obtain the learned higher-order representation. On the Number of Linear Regions of Deep Neural Networks

The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network’s depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions.

(a) Space folding of 2-D Euclidean space along the two coordinate axes. (b) An illustration of how the top-level partitioning (on the right) is replicated to the original input space (left). © Identification of regions across the layers of a deep model.

Space folding of 2-D space in a non-trivial way. Note how the folding can potentially identify symmetries in the boundary that it needs to learn.

On the Expressive Power of Deep Learning: A Tensor Analysis

The three architectural elements of locality, sharing and pooling, which have facilitated the great success of convolutional networks, are all lacking in existing theoretical studies of depth efficiency.

We use sum nodes to implement convolutions (locality with sharing), and product nodes to realize pooling.

The models we arrive at may be viewed as convolutional networks with product pooling and linear point-wise activation. They are attractive on three accounts. First, as discussed in app. E, convolutional arithmetic circuits are equivalent to SimNets, a new deep learning architecture that has recently demonstrated promising empirical results on various image recognition benchmarks (Cohen et al. (2016)). Second, as we show in sec. 3, convolutional arithmetic circuits are realizations of hierarchical tensor decompositions (see Hackbusch (2012)), opening the door to various mathematical and algorithmic tools for their analysis and implementation. Third, the depth efficiency of convolutional arithmetic circuits, which we analyze in sec. 4, was shown in the subsequent work of Cohen and Shashua (2016) to be superior to the depth efficiency of the popular convolutional rectifier networks, namely convolutional networks with rectified linear (ReLU) activation and max or average pooling. Very Deep Convolutional Networks for Natural Language Processing

What makes convolutional networks appropriate for image processing is that the visual world is compositional. Pixels combine to form shapes, contours and profiles. With ConvNets, higher level features are derived from lower level features to form a hierarchical representation. The convolution filters of a ConvNet create patterns of increasing scale and complexity: first a contour, then a shape, a face, etc. An image and a small text have similar properties. Texts are also compositional for many languages. Characters combine to form n-grams, stems, words, phrase, sentences etc. These similar properties make the comparison between computer vision and natural language processing very profitable and we believe future research should invest into making text processing models deeper. Our work is a first attempt towards this goal. Do Deep Nets Really Need to be Deep?

Our success in training shallow neural nets to mimic deeper models suggests that there probably exist better algorithms for training shallow feed-forward nets than those currently available.

Learning hierarchical category structure in deep neural networks

By analyzing the covariance structure of hierarchical probabilistic models, showing that progressive differentiation is a general feature of learning hierarchically structured training data in deep neural networks.

First three modes of the singular value decomposition of a toy dataset. Left: The learning environment is specified by an input-output correlation matrix. Right: The SVD decomposes Σ31 into modes that link a set of coherently covarying items (object analyzer vectors in the rows of V T ) to a set of coherently covarying properties (feature synthesizer vectors in the columns of U). The overall strength of this link is given by the singular values lying along the diagonal of S. In this toy example, mode 1 distinguishes plants from animals; mode 2 birds from fish; and mode 3 flowers from trees.

the number of units in the shallow network has to grow exponentially, compared to a linear growth in the deep network, so as to represent the same functions. The deep version thus offers a much more compact representation of the same functions. Critical Behavior from Deep Dynamics: A Hidden Dimension in Natural Language

How can we know when machines are bad or good? The old answer is to compute the loss function. The new answer is to also compute the mutual information as a function of separation, which can immediately show how well the model is doing at capturing correlations on different scales. Do semantic parts emerge in Convolutional Neural Networks?

We have analyzed the emergence of semantic parts in CNNs. We have investigated whether the network’s filters learn to respond to semantic parts. We have associated filter stimuli with ground-truth part bounding-boxes in order to perform a quantitative evaluation for different layers, network architectures and supervision levels. Despite promoting this emergence by providing favorable settings and multiple assists, we found that only 34 out of 123 semantic parts in PASCAL-Part dataset [5] emerge in AlexNet [6] fine tuned for object detection [7].

Discriminativness of PASCAL-Parts for the classification of their objects. The vertical axis indicates the difference δ between the classification score for the original image, and the score for the image after blacking out the part. We report averages over all images in an object class (higher values mean more discriminative). The Fundamental Problem with the Building Block Hypothesis The Fundamental Problem with the Building Block Hypothesis

Will building blocks go the way of luminiferous aether? Time will tell. At this point what we can say is that certain parallels between the two are unmistakeable. Both are the result of extrapolation—of the mechanical basis of wave propagation in the second case, and of the purported process underlying human innovation in the first—and both require believers to embrace, whether consciously or otherwise, assumptions so strong that they seem almost magical. Deep vs. Shallow Networks: an Approximation Theory Perspective

We propose a new definition of relative dimension to encapsulate different notions of sparsity of a function class that can possibly be exploited by deep networks but not by shallow ones to drastically reduce the complexity required for approximation and learning. Why does deep and cheap learning work so well?

binary tree hierarchical network in 8 variables, which approximates well functions of the form (2.2). Each of the nodes consists of n units and computes the ridge function ∑n i=1 aiσ(〈vi,x〉 + ti), with vi,x ∈ R2, ai, ti ∈ R. Similar to the shallow network such a hierarchical network can approximate any continuous function; the text proves how it approximates compositional functions better than a shallow network. Shift invariance may additionally hold implying that the weights in each layer are the same. The inset at the top right shows a network similar to ResNets: our results on binary trees apply to this case as well with obvious changes in the constants

By studying the sufficient statistics of the generative process, we showed that the inference problem requires approximating a compositional function of the form f1 ◦ f2 ◦ f2 ◦ · · · that optimally distills out the information of interest from irrelevant noise in a hierarchical process that mirrors the generative process. Although such compositional functions can be efficiently implemented by a deep neural network as long as their individual steps can, it is generally not possible to retain the efficiency while flattening the network. We extend existing “no-flattening” theorems [10–12] by showing that efficient flattening is impossible even for many important cases involving linear networks. Why Deep Neural Networks?

In this paper, we show that, for a large class of piecewise smooth functions, the number of neurons needed by a shallow network to approximate a function is exponentially larger than the corresponding number of neurons needed by a deep network for a given degree of function approximation.

An n-layer neural network structure for finding the binary expansion of a number in [0, 1].

The implementation of polynomial function Why and When Can Deep – but Not Shallow – Networks Avoid the Curse of Dimensionality

There are at least three main sets of theory questions about Deep Neural Networks. The first set of questions is about the power of the architecture – which classes of functions can it approximate and learn well? The second set of questions is about the learning process: why is SGD (Stochastic Gradient Descent) so unreasonably efficient, at least in appearance? Do good local minima exist in deep networks? Are they easier to find in deep rather than in shallow networks? The third set of questions is about generalization properties of deep learning, since it appears that deep networks suffer much less from overfitting than classical shallow networks.

The paper reviews an emerging body of theoretical results on deep learning including the conditions under which it can be exponentially better than shallow learning. Deep convolutional networks represent an important special case of these conditions, though weight sharing is not the main reason for their exponential advantage. Explanation of a few key theorems is provided together with new results, open problems and conjectures.

The top graphs are associated to functions; each of the bottom diagrams depicts the network approximating the function above. In a) a shallow universal network in 8 variables and N units approximates a generic function of 8 variables f(x1, · · · , x8). Inset b) shows a binary tree hierarchical network at the bottom in n = 8 variables, which approximates well functions of the form f(x1, · · · , x8) = h3(h21(h11(x1, x2), h12(x3, x4)), h22(h13(x5, x6), h14(x7, x8))) as represented by the binary graph above. Each of the n− 1 nodes in the graph of the function becomes, in the approximating network below, a set of Q ReLU units (smooth or not) with Q = N n−1 computing the ridge function ∑Q i=1 ai(〈vi,x〉+ ti)+, with vi,x ∈ R2, ai, ti ∈ R. Each term in the ridge function corresponds to a unit in the node and to a “channel” (in deep networks language). In a binary tree with n inputs, there are log2n levels and a total of n− 1 nodes. Similar to the shallow network, a hierarchical network is universal, that is, it can approximate any continuous function; the text proves that it approximates a compositional functions exponentially better than a shallow network. No invariance – that is weight sharing – is assumed here. Notice that the key property that makes deep nets exponentially better than shallow for compositional functions is the locality of the constituent functions that is their low dimensionality. Weight sharing corresponds to all constituent functions at one level to be the same (h11 = h12 etc.). Inset c) shows a different mechanism that can be exploited by the deep network at the bottom to reduce the curse of dimensionality in the compositional function at the top: leveraging different degrees of smoothness of the constituent functions, see Theorem 5 in the text. HIERARCHICAL COMPOSITIONAL FEATURE LEARNING

We introduce the hierarchical compositional network (HCN), a directed generative model able to discover and disentangle, without supervision, the building blocks of a set of binary images. The building blocks are binary features defined hierarchically as a composition of some of the features in the layer immediately below, arranged in a particular manner. At a high level, HCN is similar to a sigmoid belief network with pooling. Inference and learning in HCN are very challenging and existing variational approximations do not work satisfactorily. A main contribution of this work is to show that both can be addressed using max-product message passing (MPMP) with a particular schedule (no EM required). Also, using MPMP as an inference engine for HCN makes new tasks simple: adding supervision information, classifying images, or performing in painting all correspond to clamping some variables of the model to their known values and running MPMP on the rest. When used for classification, fast inference with HCN has exactly the same functional form as a convolutional neural network (CNN) with linear activations and binary weights. However, HCN’s features are qualitatively very different. Convergent Learning: Do different neural networks learn the same representations?

This initial investigation reveals a few previously unknown properties of neural networks, and we argue that future research into the question of convergent learning will yield many more.

Our main findings include: 1. Some features are learned reliably in multiple networks, yet other features are not consistently learned. 2. Units learn to span low-dimensional subspaces and, while these subspaces are common to multiple networks, the specific basis vectors learned are not. 3. The representation codes are a mix between a local (single unit) code and slightly, but not fully, distributed codes across multiple units. 4. The average activation values of neurons vary considerably within a network, yet the mean activation values across different networks converge to an almost identical distribution. Survey of Expressivity in Deep Neural Networks

The paper motivates and develops three natural measures of expressiveness, which all display an exponential dependence on the depth of the network. In fact, all of these measures are related to a fourth quantity, trajectory length. This quantity grows exponentially in the depth of the network, and is responsible for the depth sensitivity observed. These results translate to consequences for networks during and after training. They suggest that parameters earlier in a network have greater influence on its expressive power – in particular, given a layer, its influence on expressivity is determined by the remaining depth of the network after that layer.

In more detail, the measures of expressivity are:

Transitions: Counting neuron transitions is introduced indirectly via linear regions in [9], and provides a tractable method to estimate the non-linearity of the computed function.

Activation Patterns: Transitions of a single neuron can be extended to the outputs of all neurons in all layers, leading to the (global) definition of a network activation pattern, also a measure of non-linearity. Network activation patterns directly show how the network partitions input space (into convex polytopes), through connections to the theory of hyperplane arrangements, Figure 1.

Dichotomies: The heterogeneity of a generic class of functions from a particular architecture is also measured, by counting the number of dichotomies seen for a fixed set of inputs. This measure is ‘statistically dual’ to sweeping input in some cases. On the Expressive Power of Deep Learning: A Tensor Analysis

In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community. On the Expressive Power of Deep Learning: A Tensor Analysis On the ability of neural nets to express distributions

We take a first cut at explaining the expressivity of multilayer nets by giving a sufficient criterion for a function to be approximable by a neural network with n hidden layers. A key ingredient is Barron's Theorem \cite{Barron1993}, which gives a Fourier criterion for approximability of a function by a neural network with 1 hidden layer. We show that a composition of n functions which satisfy certain Fourier conditions (“Barron functions”) can be approximated by a n+1-layer neural network. For probability distributions, this translates into a criterion for a probability distribution to be approximable in Wasserstein distance—a natural metric on probability distributions—by a neural network applied to a fixed base distribution (e.g., multivariate gaussian). Building up recent lower bound work, we also give an example function that shows that composition of Barron functions is more expressive than Barron functions alone. Understanding and Improving Convolutional Neural Networks via Concatenated Rectified Linear Units

pecifically, we first examine existing CNN models and observe an intriguing property that the filters in the lower layers form pairs (i.e., filters with opposite phase). Inspired by our observation, we propose a novel, simple yet effective activation scheme called concatenated ReLU (CRelu) and theoretically analyze its reconstruction property in CNNs. We integrate CRelu into several state-of-the-art CNN architectures and demonstrate improvement in their recognition performance on CIFAR-10/100 and ImageNet datasets with fewer trainable parameters. Learning across scales - A multiscale method for Convolution Neural Networks

In this work we explore the connection between Convolution Neural Networks, partial differential equations, multigrid/multiscale methods and and optimal control. We show that convolution neural networks can be represented as a discretization of nonlinear partial differential equations, and that the learning process can be interpreted as a control problem where we attempt to estimate the coefficients of the differential equations from data. This interpretation allows us to generate an efficient multilevel/multiscale process (sometimes referred to as image pyramid) and algorithms for the learning problem that can substantially reduce the cost of learning. Furthermore, this interpretation allows us to use the coefficients that are calculated on high resolution images for low resolution images directly. The power of deeper networks for expressing natural functions Exponential expressivity in deep neural networks through transient chaos

We combine Riemannian geometry with the mean field theory of high dimensional chaos to study the nature of signal propagation in generic, deep neural networks with random weights. Our results reveal an order-to-chaos expressivity phase transition, with networks in the chaotic phase computing nonlinear functions whose global curvature grows exponentially with depth but not width. We prove this generic class of deep random functions cannot be efficiently computed by any shallow network, going beyond prior work restricted to the analysis of single functions. Moreover, we formalize and quantitatively demonstrate the long conjectured idea that deep networks can disentangle highly curved manifolds in input space into flat manifolds in hidden space. Our theoretical analysis of the expressive power of deep networks broadly applies to arbitrary nonlinearities, and provides a quantitative underpinning for previously abstract notions about the geometry of deep functions. Hierarchical Adversarially Learned Inference

We show that the features learned by our model in an unsupervised way outperform the best handcrafted features Information Scaling Law of Deep Neural Networks

With the typical DNNs named Convolutional Arithmetic Circuits (ConvACs), the complex DNNs can be converted into mathematical formula. Thus, we can use rigorous mathematical theory especially the information theory to analyse the complicated DNNs. In this paper, we propose a novel information scaling law scheme that can interpret the network’s inner organization by information theory. First, we show the informational interpretation of the activation function. Secondly, we prove that the information entropy increases when the information is transmitted through the ConvACs. Finally, we propose the information scaling law of ConvACs through making a reasonable assumption. Deep Learning Scaling is Predictable, Empirically Latent Space Policies for Hierarchical Reinforcement Learning

First, each layer in the hierarchy can be trained with exactly the same algorithm. Second, by using an invertible mapping from latent variables to actions, each layer becomes invertible, which means that the higher layer can always perfectly invert any behavior of the lower layer. This makes it possible to train lower layers on heuristic shaping rewards, while higher layers can still optimize task-specific rewards with good asymptotic performance. Finally, our method has a natural interpretation as an iterative procedure for constructing graphical models that gradually simplify the task dynamics. Boosting Dilated Convolutional Networks with Mixed Tensor Decompositions

We focus on dilated convolutional networks, a family of deep models delivering state of the art performance in sequence processing tasks. By introducing and analyzing the concept of mixed tensor decompositions, we prove that interconnecting dilated convolutional networks can lead to expressive efficiency. In particular, we show that even a single connection between intermediate layers can already lead to an almost quadratic gap, which in large-scale settings typically makes the difference between a model that is practical and one that is not Automatically Composing Representation Transformations as a Means for Generalization Compositional GAN: Learning Conditional Image Composition Hierarchical Imitation and Reinforcement Learning

We propose an algorithmic framework, called hierarchical guidance, that leverages the hierarchical structure of the underlying problem to integrate different modes of expert interaction. Our framework can incorporate different combinations of imitation learning (IL) and reinforcement learning (RL) at different levels, leading to dramatic reductions in both expert effort and cost of exploration. Representation Learning with Contrastive Predictive Coding