Edit here: https://docs.google.com/a/codeaudit.com/document/d/1aiTrtrqEkaFKak1gqpZ-ju1g3Ao0UsreM3NUvDXDfvc/edit?usp=sharing

Canonical Patterns

“Dirac discovered the correct laws for relativity quantum mechanics simply by guessing the equation. The method of guessing the equation seems to be a pretty effective way of guessing new laws. This shows again that mathematics is a deep way of expressing nature, and any attempt to express nature in philosophical principles, or in seat-of-the-pants mechanical feelings, is not an efficient way.” - Richard Feynman

Canonical patterns are conceptual patterns that are conspicuously present in all deep learning implementations. You will find that all other patterns will be related to at least one of these canonical patterns. It is conceivable that in some foreseeable future, wherein a mature theoretical framework for DNN would be formulated, these canonical patterns would serve as fundamental building blocks.

Canonical patterns also provide a sense of the scope and applicability of DNNs. In any complex system, it is critical to understand its capabilities as well as being aware of its limitations. One benefit of the pattern language approach is that every pattern explores in depth both the strengths and weaknesses of an approach. Research papers in contrast tend to have a bias to promote their novelty. This bias as a consequence places blinders in our understanding of these complex systems. Focusing also in addressing limitations and weaknesses should therefore be a mandatory element of any discussion of complex solutions. Furthermore, an understanding limitations helps us avoid implementation pitfalls.

The diagram above highlights the various inter-related patterns that form the back-bone of any DL network. At its core there are three generalized operators that are essential components. A network will be composed of layers of Similarity, Irreversible and Merge operators. We could alternatively call them Projection, Selection and Collection operators. These three operators work in conjunction to provide the essential processing capabilities of a network. To illustrate this with a conventional neural network, the Similarity operator is the sum of products of the inputs. The Irreversible operator is the non-linear activation function. The Merge operator is the routing of outputs into a subsequent layer of the network. Each operator has an intrinsic characteristic that is essential in creating a trainable network capable of generalized classification. Furthermore, all three operators are necessary.

We can show a simple proof that shows that these 3 operators can support universal computation. We can in fact construct a programmable boolean NOR gate with these operators. A NOR gate is universal in that you can construct any other boolean expression by tying together nor gates. We will show that if you tie together a Similarity operator with a Threshold operator, you construct a programmable NOR gate.

The Similarity or Projection operator from the boolean perspective is a collection of XNOR gates. As you can see from the truth table, an XNOR gate individually acts like a boolean template matching element. A collection of XNOR gates acts like a projection operator across a multi-dimensional boolean vector. Now if you take the input of one XNOR gate and tie it to an input and the other one tie it to the internal weight of the network, you arrive at a programmable inverter. That is, if the network weight is true, then XNOR will be just the identity. When the network weight is false, then XNOR will invert the input. So the weight effective is able to program the activation of an inverter.

The Threshold operator can be represented by a gate that takes as its configuration a number which indicates the number of input values that have to be true for the gate. So for the case of a 2 input gate, we configure the gate to have with a configuration of 2. This effectively is equivalent to an AND gate. In other words, the output is only true when both inputs are true. You can then see that if you tie together a XNOR gate for each input of the AND gate, you get a gate that by De Morgan's rule, equivalent to a NOR gate. This is a simple proof that shows that collections of Merge and Threshold operators can represent any boolean function.

You can also see the universal programmability of such a collection if you have a Merge component that ties every output of one layer to every input in the next layer. Neural networks however are not boolean circuits, rather they are universal approximators. We can see in the Random Projections section that, we can approximate multi-dimensional vectors with random projections at a lower dimension.

At the core of any machine learning algorithms is a the concept of a Distance Measure. There are many ways to define Distance Measure we shall discover. Entropy is a kind of distance, it is an aggregate measure of the entire model with its predicted response. Entropy is derivable from the KL divergence which is a measure of the divergence between probability distributions. Note that divergence is a weaker form of distance where the triangle inequality does not apply.

One unique trait of neural networks is the use of an Activation function. The original motivation for the activation function was that it provided a non-linearity such that a layered network cannot be equivalent to a single layered linear network. However, with the discovery of the effectiveness of the piecewise linear ReLU activation function, this non-linearity requirement needs to be challenged. The key characteristic of an Activation function appears that the operation is performing a Threshold operation. That is, similarity measures that are below a certain threshold are ignored by the function. This function actually loses information. In fact, we can think of it as an operator that ignores information that is irrelevant (i.e. features that are invariant) to the local and global fitness function. We shall see that the generalization of the activation function is in fact an invariant removal function (sometimes known more elegantly as symmetry breaking).

It is always important to remember that neural networks are classifiers or pattern matching machines. So it should not come as a surprise that the key operators are an operator that measures similarity and a operator that performs a selection. The resolving capability of a neural network comes from the fact that it is built from classifiers composed of classifiers.

Self-replication is a consequence of Dissipative Adaptation. Dissipative Adaption is a mechanism in physics that accounts for the emergence of self-replicating structures. In the context of neural networks, the self-replicating structures are in the form of model parameters. Training in neural networks is an external force that drives a network into an equilibrium state that minimizes the Entropy. Or said alternatively, improves the accuracy of the Model with respect to its predictions. Dissipative Adaptation does appear to work within a layer and explains the self-organizing structure of a layer.

The ability to form abstraction requires an additional operator (i.e. Merge operator) that is able to combine features in an input space into new features in an output space. There is no explicit Merge operator in a classic neural network. However, the Merge operator is found in most DL frameworks that involve slicing and dicing of the outputs to prepare them for the next layer. Its effect is that it routes all outputs of a layer into the layer above. The Merge operator in combination with the Threshold operator is essential in building more expressive deep networks. The evolution of a layer however is constrained from above and below. That is it is constrained to evolve only within what is expressible in the features present in its input. It driven to evolve in reaction training data and its effect on the fitness function and regularization constraints.

A network builds Distributed models where knowledge is diffused across neurons in a layer as well as across the layers in a deep learning hierarchy. The diffusion of knowledge leads to difficulty of Model Interpretability. However this issue does not prevent a network from performing classification. There exists in high-dimensional data, a sparse basis that can be discovered such that a lower number of dimensions are required for interpretability.

Distributed models are able to perform accurate classification as a consequence of involving an Ensemble of classifiers. The predictions of each Ensemble member requires that the output results be Scale Invariant. Scale Invariance can be achieved by normalization. Random Projections is also an example of an Ensemble, albeit a collection of simple classifiers working in concert. What you then begin to realize is that neural network are in fact self-similar in construction. Not only can you have layers on top of layers, but you can have entire networks inside each layer. Ensembles are can generalize better when their is a high variability in their classification capability. Random Projections do well when the projection hyper-planes are orthogonal to one another. The signature of Randomness is pervasive in ML research that we shall commit a pattern to discuss its implications. One other kind of Distributed Model are Associative Memories, there are some theories in this subje

Finally there is the open question as to what can a Distributed Model predict? Neural networks are able to predict the cause of an observation given the observed effect. This is known as Anti-causal reasoning. So for example, it does very well in predicting handwritten digits. The image of a handwritten number is the effect of a person's intention of writing a number that is comprehensible without the intent of obfuscation. However, if one were to try predict the effect given a cause, then a network may have trouble performing this task. There is a class of systems that are Computational Irreducible. What this means is the behavior of systems in the class cannot be predicted without performing a approximately accurate simulation.

Each of the ideas introduced here will be presented in pattern form. Provided here is a suggested order where one can read this. Following this order however should not be necessary and following the graph above maybe more illuminating.

Open Questions

  • Scale Invariance

References

http://research.microsoft.com/apps/video/default.aspx?id=206977 Causes and Counterfactuals: Concepts, Principles and Tools.

https://en.wikipedia.org/wiki/Attractor#Strange_attractor

https://en.wikipedia.org/wiki/The_Emperor%27s_New_Mind

https://en.wikipedia.org/wiki/Shadows_of_the_Mind

http://arxiv.org/abs/1606.05336v1 On the expressive power of deep neural networks

http://arxiv.org/pdf/1509.08627v1.pdf Semantics, Representations and Grammars for Deep Learning

http://arxiv.org/pdf/1102.2468v1.pdf Algorithmic Randomness as Foundation of Inductive Reasoning and Artificial Intelligence

http://arxiv.org/pdf/1608.08225v1.pdf Why does deep and cheap learning work so well?

We conjecture that approximating a multiplication gate x1x2 · · · xn will require exponentially many neurons in n using non-pathological activation functions, whereas we have shown that allowing for log2 n layers allows us to use only ∼ 4n neurons.

http://arxiv.org/abs/1605.06444v2 Unreasonable Effectiveness of Learning Neural Networks: From Accessible States and Robust Ensembles to Basic Algorithmic Schemes

http://www.asimovinstitute.org/neural-network-zoo/

https://arxiv.org/pdf/1611.02420v1.pdf Meaning = Information + Evolution

http://math.mit.edu/~freer/papers/PhysRevLett_110-168702.pdf Causal Entropic Forces

http://threeplusone.com/Still2012.pdf Thermodynamics of Prediction

https://arxiv.org/pdf/1702.07800.pdf On the Origin of Deep Learning

© 2016 Copyright - Carlos E. Perez

Carlos Perez, 2016/07/01 13:42
Add Regularization, HJB Eq and Forgetting into the diagram.
Carlos E. Perez, 2016/07/01 19:36
if you assume that the original data is in a sparse basis, then I don't think you'll have information loss.

the activation function is a partition/selection operator that slices up the measure (as a result of the dot product in the neuron) and lets only a certain threshold through. ReLU incidentally is linear (piecewise). The activation function does not need to be non-linear (that was an age old argument that you needed it to be for it to be a universal approximator). It just needs to act like a fitness function, essentially throwing away anything under a threshold. That's why a neuron is actually self-similar to the entire network.

the feature space of the input is not the same as the feature space of the output. It simply cannot be, since you run multiple projections to create the output feature space. It is all scrambled up by the time you reach the output feature space. However, it is not completely random. By the method of random projections.... it actually preserves the local similarity of the original input feature space! So the Models are actually length preserving, however the angles between the input and output features have changed.

I don't think the weight matrices are orthogonal though. if they were, then you would have some kind of rotation in multi-dimensional space. they are however are like shearing tensors that deform the original space. That's where the deformation is happening, in the weight matrix and not in the activation function. The activation function is just an information loss function. Which is indeed strange, you gain information by throwing out information... that intuitively is known as abstraction.
Enter your comment:
R K H J E