Dissipative Adaptation

“You start with a random clump of atoms, and if you shine light on it for long enough, it should not be so surprising that you get a plant.“ - Jeremy England


The phenomena of dissipative adaptation explains the trainability of neural networks.


How do we setup the conditions such that a predictive model can spontaneously organize?




Classical optimization terminology is that a optimization problem leads to convergence when it discovers a solution to the objective function that is at is global extremum. DL systems differ from classical optimization by requiring the need for generalization. Generalization is facilitated by any form of regularization either in the objective function or in the structure of the neural network. How then does a system trained via stochastic gradient descent (SGD) is able to evolve Models that reach convergence?

From the perspective of deep learning practitioners, stochastic gradient descent works surprisingly well. This is actually unintuitive for many experts in the optimization field. High dimensional problems are supposed to be non-convex and therefore extremely hard to optimize. An extremely simplistic method like SGD is not expected to be effective in the high complexity and high dimensionality space that deep learning networks find themselves in.

Experimental evidence has shown that in high dimensional spaces, the space neighboring the minimal point have a much higher probability of being a saddle point. A saddle point gifts the optimization process with many more opportunities to escape the minima and move forwards. This argument explains why large networks don't appear to often get stuck in a non-optimal state.

The phenomena of Dissipative Adaptation is mechanism found in dynamical systems that may explain how and why deep learning systems converge into stable attractor basins. Dissipative Adaptation provides an explanation as to why self-replicating structures arise in physical systems. Dissipative Adaption explains the dynamics of a system in contact with a thermal reservoir and with an external energy source acting also on the system. In said system, different configurations of the system are not equally able to absorb energy from that external source. The absorption of energy from an external source allows the system configuration to traverse activation barriers too high to jump rapidly by thermal fluctuations alone. If energy is dissipated after a jump, then this energy is not available for the system to reversibly jump back from where it came. Even though any given change in configuration of the system is random, the most likely configuration (as a consequence of irreversibility) happens to be the configuration that that aligns more efficiently with the absorption and dissipation of external energy.

Artificial neural network differ however from physical systems in that there is no notion of conservation of energy. In contrast, the relevant measure is the relative entropy or alternatively the fitness function. It is a measure of similarity between the observation and prediction. The self-similarity of a neural network implies that at all units down to the most basic neuron, there is a function that computes a similarity between observation and prediction. The analogy to external energy source in the neural network context are the external observations of the system. Through training, the network is subjected with perturbations that is driven towards minimizing entropy. This propagates down to every neuron such that those neuron that are aligned to the perturbations are those likely to remain aligned. A neuron's activation function is equivalent to that energy barrier that is an irreversible operation once the entropy moves to a lower state. With the passage of training, the memory of these less erasable changes accumulates, and the system increasingly adopts a model that is best adapted to the training data. So from a random model, the neural network evolves into a model that is adapted to the stochastic observations (i.e. SGD) that it is trained under.

Perhaps the continuity requirements of back propagation, independent of stochasticity, ensures that there are no big local transitions in model changes. That only significant cumulative events are required to achieve a persistent change in structure. Backpropagation ensures that the random changes due to training are not purely random, but rather constrained to changes that preserve continuity. Each individual neuron adjusts its weights in the direction of minimum entropy. Each neuron evolves to its local minima and is unable to extricate itself unless a sufficient accumulation of signals exists in the training data. As more and more neurons arrive a minima, larger collective cliques of neurons are formed. These cliques become more difficult to breakup. Only coordinated signals are able to break up cliques, and as there cliques become larger, the more signals will to be in synchronization.

Dissipative Adaptation explains trainability however it does not explain generalization or expressivity.

Known Uses

Related Patterns


Relationship to Canonical Patterns

  • Similarity yields the measure of alignment between the driving force (i.e. training data) and the system (i.e. model)
  • Irreversibility enforces an irreversibility between layers.
  • Entropy takes the place of energy in terms of an information model rather than a physical system.
  • Geometry expresses the model's evolutionary rather the evolution of a physical system.
  • Ensembles that align with the training data fluctuations will become more prevalent.
  • Distributed Representation * how does dissipative adaptation lead to a distributed representation?
  • Regularization constrains the information model from reaching certain configurations.
  • Mutual Information * How does the criteria of mutual information influence dissipative adaptation? Does mutual information constrain model cliques?
  • Hierarchical Abstraction leads similarly to an irreversible configuration.
  • Self Similarity is promoted through dissipative adaptation in that structures that best align with training data are most favored.
  • Modularity is present in that the evolutionary mechanism is made available by linking information models in a way to be subject to the same driving force.
  • Anti-causality (need more analysis)
  • Propagator drives the evolution of the high dimensional system.
  • Computational Irreducibility surfaces the question as to whether short-cuts in evolution is possible through this mechanism.
  • Forgetting would imply that the state of the system is able to revert back to an otherwise irreversible state.

References Entropy production fluctuation theorem and the nonequilibrium work relation for free energy differences

Gavin Crooks mathematically described microscopic irreversibility. Crooks showed that a small open system driven by an external source of energy could change in an irreversible way, as long as it dissipates its energy as it changes.

In a collection of assembling particles that is allowed to reach thermal equilibrium, the energy of a given microscopic arrangement and the probability of observing the system in that arrangement obey a simple exponential relationship known as the Boltzmann distribution. Once the same thermally fluctuating particles are driven away from equilibrium by forces that do work on the system over time, however, it becomes significantly more challenging to relate the likelihood of a given outcome to familiar thermodynamic quantities. Nonetheless, it has long been appreciated that developing a sound and general understanding of the thermodynamics of such non-equilibrium scenarios could ultimately enable us to control and imitate the marvellous successes that living things achieve in driven self-assembly. I propose that they imply a general thermodynamic mechanism for self-organization via dissipation of absorbed work that may be applicable in a broad class of driven many-body systems.

Irreversibility and heat dissipation go hand in hand. Statistical Physics of Adaptation 1977 Nobel Prize

Quite generally it is possible in principle to distinguish between two types of structures: equilibrium structures, which can exist as isolated systems (for example crystals), and dissipative structures, which can only exist in symbiosis with their surroundings. Dissipative structures display two types of behaviour: close to equilibrium their order tends to be destroyed but far from equilibrium order can be maintained and new structures be formed.

The probability for order to arise from disorder is infinitesimal according to the laws of chance. The formation of ordered, dissipative systems demonstrates, however, that it is possible to create order from disorder. The description of these structures have led to many fundamental discoveries and applications in diverse fields of human endeavour, not only in chemistry. In the last few years applications in biology have been dominating but the theory of dissipative structures has also been used to describe phenomena in social-systems. Self-Assembled Circuit Patterns Theory of Connectivity: Nature and Nurture of Cell Assemblies and Cognitive Computation Information theory, predictability, and the emergence of complex life