This is an old revision of the document!


Mutual Information


This identifies the pattern and should be representative of the concept that it describes. The name should be a noun that should be easily usable within a sentence. We would like the pattern to be easily referenceable in conversation between practitioners.


Describes in a single concise sentence the meaning of the pattern.


This section describes the reason why this pattern is needed in practice. Other pattern languages indicate this as the Problem. In our pattern language, we express this in a question or several questions and then we provide further explanation behind the question.


This section provides alternative descriptions of the pattern in the form of an illustration or alternative formal expression. By looking at the sketch a reader may quickly understand the essence of the pattern.


One key distinction with neural networks and classical statistical methods is that the requirement of independent and identically distributed random variables (i.i.d.) are not a strict requirement. In fact, neural networks try to entangle features as information flows up the layers of a network. A consequence of the merge operator is that entanglement is encouraged to create higher abstractions. This entanglement alludes to the importance of measuring the mutual information between variables.

Other machine learning algorithms are designed to disentangle features into an orthogonal basis. A sparse orthogonal basis would be nice however it is not always achievable.

Known Uses

Here we review several projects or papers that have used this pattern.

Related Patterns In this section we describe in a diagram how this pattern is conceptually related to other patterns. The relationships may be as precise or may be fuzzy, so we provide further explanation into the nature of the relationship. We also describe other patterns may not be conceptually related but work well in combination with this pattern.

Relationship to Canonical Patterns

Relationship to other Patterns

Further Reading

We provide here some additional external material that will help in exploring this pattern in more detail.


Estimating mutual information in high dimensions via classification error

InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets

InfoGAN is a generative adversarial network that also maximizes the mutual information between a small subset of the latent variables and the observation. We derive a lower bound to the mutual information objective that can be optimized efficiently, and show that our training procedure can be interpreted as a variation of the Wake-Sleep algorithm. Information Theoretic-Learning Auto-Encoder

Information-theoretic learning (ITL) is a field at the intersection of machine learning and information theory which encompasses a family of algorithms that compute and optimize information theoretic descriptors such as entropy, divergence, and mutual information. ITL objectives are computed directly from samples (non-parametrically) using Parzen windowing and Renyi’s entropy. Critical Behavior from Deep Dynamics: A Hidden Dimension in Natural Language

How can we know when machines are bad or good? The old answer is to compute the loss function. The new answer is to also compute the mutual information as a function of separation, which can immediately show how well the model is doing at capturing correlations on different scales. A NOTE ON THE EVALUATION OF GENERATIVE MODELS

An evaluation based on samples is biased towards models which overfit and therefore a poor indicator of a good density model in a log-likelihood sense, which favors models with large entropy. Conversely, a high likelihood does not guarantee visually pleasing samples.

We therefore argue Parzen window estimates should be avoided for evaluating generative models, unless the application specifically requires such a loss function. In this case, we have shown that a k-means based model can perform better than the true density. To summarize, our results demonstrate that for generative models there is no one-fits-all loss function but a proper assessment of model performance is only possible in the the context of an application. Mutual Information and Diverse Decoding Improve Neural Machine Translation

We introduce an alternative objective function for neural MT that maximizes the mutual information between the source and target sentences, modeling the bi-directional dependency of sources and targets. Neural Word Embedding as Implicit Matrix Factorization

We analyze skip-gram with negative-sampling (SGNS), a word embedding method introduced by Mikolov et al., and show that it is implicitly factorizing a word-context matrix, whose cells are the pointwise mutual information (PMI) of the respective word and context pairs, shifted by a global constant. Analysis of k-Nearest Neighbor Distances with Application to Entropy Estimation The Mutual Information in Random Linear Estimation

A particularly influential approach to this question has been through the use of the heuristic replica method of statistical physics [8], which allows to compute non rigorously the mutual information (MI) and the associated theoretically achievable minimal-mean-square error (MMSE). The replica method typically predicts the optimal performance through the solution of non-linear equations, which interestingly coincide in many cases with the predictions for the performance of a message-passing belief-propagation type algorithm. In this context the algorithm is usually called approximate message-passing. Information-theoretical label embeddings for large-scale image classification

We present a method for training multi-label, massively multi-class image classification models, that is faster and more accurate than supervision via a sigmoid cross-entropy loss (logistic regression). Our method consists in embedding high-dimensional sparse labels onto a lower-dimensional dense sphere of unit-normed vectors, and treating the classification problem as a cosine proximity regression problem on this sphere. We test our method on a dataset of 300 million high-resolution images with 17,000 labels, where it yields considerably faster convergence, as well as a 7% higher mean average precision compared to logistic regression. GANs, mutual information, and possibly algorithm selection?

To summarize, GAN can be viewed as an augmented generative model which is trained by minimizing mutual information. Task Specific Adversarial Cost Function

We propose an alternative adversarial cost function which allows easy tuning of the model for either task. Our task specific cost function is evaluated on a dataset of hand-written characters in the following tasks: Generation, retrieval and one-shot learning.

For a data distribution P (x) = ∑ i δ(x− xi) for i ∈ {1, 2, 3} a model, Q(x) with finite capacity may be fit by minimising either A) KL[Q‖P ] which captures one region of high density well, but ignores others, or B) KL[P‖Q] which captures all regions of high density while also assigning non-zero values to regions of low density. The IM Algorithm : A variational approach to Information Maximization STOCHASTIC NEURAL NETWORKS FOR HIERARCHICAL REINFORCEMENT LEARNING

INFORMATION-THEORETIC REGULARIZATION Although SNNs have sufficient expressiveness in representing multi-modal policies, there is nothing in the optimization that prevent them from converging onto a single mode. Although we have not observed this worst case scenario in our experiments, we do find that sometimes different (discrete) latent variables correspond to the same or very similar skills. It is desirable to have direct control over the diversity of skills that will be learned. To achieve this, we introduce an information theoretic regularizer, inspired by recent success of similar objectives in encouraging interpretable representation learning in GANs (Chen et al., 2016).

We design this objective specifically for mobile robots, although in principle it can be also generalized to other contexts. Concretely, we add an additional reward bonus, proportional to the mutual information between the latent variable and the state it is currently in. We only measure the mutual information with respect to a relevant subset in the state. For a mobile robot, we choose this to be the coordinates of its center of mass. We further partition the coordinate space into discrete grids, and map the continuous-valued coordinate into the closest grid. This way, calculation of empirical mutual information only requires maintaining counts of the latent variables in each grid. Mathematically, let X be a random variable denoting the grid in which the agent is currently situated, and let Z be a random variable for the latent variable. Our additional reward bonus is αHI(Z; X) = αH(H(Z) − H(Z|X)), where H denotes the entropy function. In our case, H(Z) is constant since the distribution of the latent variable is fixed, and hence maximizing mutual information is equivalent to minimizing the conditional entropy. Therefore, another interpretation of this bonus is that given where the robot is, it should be easy to infer what skill the robot is currently performing. UNDERSTANDING INTERMEDIATE LAYERS USING LINEAR CLASSIFIER PROBES

We define a new notion of information that depends on our ability to classify features of a given layer with an optimal linear classifier. Then we have a conceptual tool to ask new questions and to get potentially interesting answers.

The conceptual framework that we propose is one where the intuitive notion of information is equivalent with immediate suitability for a linear classifier (instead of being related to entropy). Information Dropout: learning optimal representations through noise

We introduce Information Dropout, a generalization of dropout that is motivated by the Information Bottleneck principle and highlights the way in which injecting noise in the activations can help in learning optimal representations of the data. Information Dropout is rooted in information theoretic principles, it includes as special cases several existing dropout methods, like Gaussian Dropout and Variational Dropout, and, unlike classical dropout, it can learn and build representations that are invariant to nuisances of the data, like occlusions and clutter. When the task is the reconstruction of the input, we show that the information dropout method yields a variational autoencoder as a special case, thus providing a link between representation learning, information theory and variational inference. Our experiments validate the theoretical intuitions behind our method, and we find that information dropout achieves a comparable or better generalization performance than binary dropout, especially on smaller models, since it can automatically adapt the noise to the structure of the network, as well as to the test sample.

The main contribution of our work is to establish how two seemingly different areas of research, namely the study of noise and dropout methods to prevent overfitting, and the study of optimal representations, can be linked through the Information Bottleneck principle. Mutual information in random Boolean models of regulatory networks

A key insight from our study is that the maximization of average pairwise mutual information is achieved in RBNs by allowing long chains of effectively single-input nodes to emerge from the background of frozen nodes and nodes with multiple unfrozen inputs.

Maximization of pairwise mutual information may be a sensible proxy for maximization of fitness within an ensemble of evolutionarily accessible networks: we suggest that systems based on high-I networks can orchestrate complex, timed behaviors, possibly allowing robust performance of a wide spectrum of tasks. The Information Dynamics of Phase Transitions in Random Boolean Networks

We use a recently published framework to characterize the distributed computation in terms of its underlying information dynamics: information storage, information transfer and information modification. We find maximizations in information storage and coherent information transfer on either side of the critical point, allowing us to explain the phase transition in RBNs in terms of the intrinsic distributed computations they are undertaking. Chain Rule for Conditional Mutual Information Discovering objects and their relations from entangled scene representations

In this work, we introduce relation networks (RNs) - a general purpose neural network architecture for object-relation reasoning. We show that RNs are capable of learning object relations from scene description data. Furthermore, we show that RNs can act as a bottleneck that induces the factorization of objects from entangled scene description inputs, and from distributed deep representations of scene images provided by a variational autoencoder. The model can also be used in conjunction with differentiable memory mechanisms for implicit relation discovery in one-shot learning tasks. Our results suggest that relation networks are a potentially powerful architecture for solving a variety of problems that require object relation reasoning. Trimming the Independent Fat: Sufficient Statistics, Mutual Information, and Predictability from Effective Channel States

As an important corollary, we consider the case where one variable is a stochastic process’ past and the other its future and the present is viewed as a memoryful channel. In this case, the mutual information is the channel transmission rate between the channel’s effective states. That is, the past-future mutual information (the excess entropy) is the amount of information about the future that can be predicted using the past. Translating our result about minimal sufficient statistics, this is equivalent to the mutual information between the forward- and reversetime causal states of computational mechanics. Multivariate Dependence Beyond Shannon Information

Accurately determining dependency structure is critical to discovering a system’s causal organization. We recently showed that the transfer entropy fails in a key aspect of this—measuring information flow—due to its conflation of dyadic and polyadic relationships. We extend this observation to demonstrate that this is true of all such Shannon information measures when used to analyze multivariate dependencies. This has broad implications, particularly when employing information to express the organization and mechanisms embedded in complex systems, including the burgeoning efforts to combine complex network theory with information theory. Here, we do not suggest that any aspect of information theory is wrong. Rather, the vast majority of its informational measures are simply inadequate for determining the meaningful dependency structure within joint probability distributions. Therefore, such information measures are inadequate for discovering intrinsic causal relations. We close by demonstrating that such distributions exist across an arbitrary set of variables. Deep Coevolutionary Network: Embedding User and Item Features for Recommendation

As users interact with di‚erent items over time, user and item features can inƒuence each other, evolve and co-evolve over time. Œe compatibility of user and item’s feature further inƒuence the future interaction between users and items.

To address these limitations, we propose a novel deep coevolutionary network model (DeepCoevolve), for learning user and item features based on their interaction graph. DeepCoevolve use recurrent neural network (RNN) over evolving networks to de€ne the intensity function in point processes, which allows the model to capture complex mutual inƒuence between users and items, and the feature evolution over time. We also develop an ecient procedure for training the model parameters, and show that the learned models lead to signi€cant improvements in recommendation and activity prediction compared to previous state-of-the-arts parametric models Opening the Black Box of Deep Neural Networks via Information

Despite their great success, there is still no comprehensive theoretical understanding of learning with Deep Neural Networks (DNNs) or their inner organization. Previous work [Tishby & Zaslavsky (2015)] proposed to analyze DNNs in the Information Plane; i.e., the plane of the Mutual Information values that each layer preserves on the input and output variables. They suggested that the goal of the network is to optimize the Information Bottleneck (IB) tradeoff between compression and prediction, successively, for each layer. In this work we follow up on this idea and demonstrate the effectiveness of the InformationPlane visualization of DNNs. We first show that the stochastic gradient descent (SGD) epochs have two distinct phases: fast empirical error minimization followed by slow representation compression, for each layer. We then argue that the DNN layers end up very close to the IB theoretical bound, and present a new theoretical argument for the computational benefit of the hidden layers.

The stochasticity of SGD methods is usually motivated as a way of escaping local minima of the training error. In this paper we give it a new, perhaps much more important role: it generates highly efficient internal representations through compression by diffusion.

We also argue that SGD seems an overkill during the diffusion phase, which consumes most of the training epochs, and that much simpler optimization algorithms, such as Monte-Carlo relaxations [German & German (1988)], can be more efficient.

To conclude, it seems fair to say, based on our experiments and analysis, that Deep Learning with DNN are in essence learning algorithms that effectively find efficient representations that are approximate minimal sufficient statistics in the IB sense. On the Limits of Learning Representations with Label-Based Supervision

Will the representations learned from these generative methods ever rival the quality of those from their supervised competitors? In this work, we argue in the affirmative, that from an information theoretic perspective, generative models have greater potential for representation learning. Based on several experimentally validated assumptions, we show that supervised learning is upper bounded in its capacity for representation learning in ways that certain generative models, such as Generative Adversarial Networks (GANs) are not. Hierarchical mutual information for the comparison of hierarchical community structures in complex networks

a generalization of the traditional mutual information, and allows to compare hierarchical partitions and hierarchical community structures. The {\it normalized} version of the hierarchical mutual information should behave analogously to the traditional normalized mutual information. Here, the correct behavior of the hierarchical mutual information is corroborated on an extensive battery of numerical experiments. The experiments are performed on artificial hierarchies, and on the hierarchical community structure of artificial and empirical networks. Furthermore, the experiments illustrate some of the practical applications of the hierarchical mutual information. Namely, the comparison of different community detection methods, and the study of the consistency, robustness and temporal evolution of the hierarchical modular structure of networks. Subregular Complexity and Deep Learning

This paper presents experiments illustrating how formal language theory can shed light on deep learning. We train naive Long Short-Term Memory (LSTM) Recurrent Neural Networks (RNNs) on six formal languages drawn from the Strictly Local (SL) and Strictly Piecewise (SP) classes. These classes are relevant to computational linguistics and among the simplest in a mathematically well-understood hierarchy of subregular classes. SL and SP classes encode local and long-distance dependencies, respectively. The results show four of the six languages were learned remarkably well, but overfitting arguably occurred with the simplest SL language and undergeneralization with the most complex SP pattern. On the Emergence of Invariance and Disentangling in Deep Representations

Using classical notions of statistical decision and information theory, we show that invariance in a deep neural network is equivalent to minimality of the representation it computes, and can be achieved by stacking layers and injecting noise in the computation, under realistic and empirically validated assumptions. We use an Information Decomposition of the empirical loss to show that overfitting can be reduced by limiting the information content stored in the weights. We then present a sharp inequality that relates the information content in the weights – which are a representation of the training set and inferred by generic optimization agnostic of invariance and disentanglement – and the minimality and total correlation of the activation functions, which are a representation of the test datum. This allows us to tackle recent puzzles concerning the generalization properties of deep networks and their relation to the geometry of the optimization residual. A Simple Language Model based on PMI Matrix Approximations

In this study, we introduce a new approach for learning language models by training them to estimate word-context pointwise mutual information (PMI), and then deriving the desired conditional probabilities from PMI at test time. Specifically, we show that with minor modifications to word2vec's algorithm, we get principled language models that are closely related to the well-established Noise Contrastive Estimation (NCE) based language models. A compelling aspect of our approach is that our models are trained with the same simple negative sampling objective function that is commonly used in word2vec to learn word embeddings. General AI Challenge Round One: Gradual Learning Trimming the Independent Fat: Sufficient Statistics, Mutual Information, and Predictability from Effective Channel States Learning Discrete Representations via Information Maximizing Self-Augmented Training

We encourage the predicted representations of augmented data points to be close to those of the original data points in an end-to-end fashion.