This is an old revision of the document!

search?q=canonical&btnI=lucky Edit:


Aliases Mixture of Experts


A more accurate network can be created using a combination of existing individually trained networks.


How can we achieve higher predictive accuracy by combining independently trained networks?


<Diagram that shows multiple subnetworks combined into a single network with an ensemble function>


One of the inescapable notions in deep learning is that we should always treat a networks behavior as emerging from collective behavior. Behavior here as a fractal property in it because we really care about collective behavior that is influenced by collective behavior. This intermixing of behavior leads to complexity that surprisingly leads towards more accurate predictability. How does a collective of equally ignorant neurons combine to become a collective with less ignorance? Why do mixtures of experts lead to better generalization than each individual expert?

It has been consistently observed that in machine learning practice that the combination of multiple classification algorithms can lead to improved prediction accuracy. The reasons for why this method works does not yet have a satisfying explanation. However, the greater the divergence between the different predicting network, the better the results. There ways that we can vary a predicting neural network. The first way is to employ a different components, specifically different layers and number of neurons. Another way is to train a network with different feature vectors.

There are several strategies on how to combine the results of multiple sub-networks.

The Ensemble Method in other machine learning methods and is not exclusive to Deep Learning.

Known Uses

The most well known use of the Ensemble Method is in IBM Watson DeepQA Jeopardy system. Wherein Watson was trained using an ensemble of expert systems. The system originally had difficulty winning against average Jeopardy players, however by training itself in games, it eventually was able to learn the correct biases to apply for its many expert systems.

Using an ensemble of matching, normalization, and coreference resolution algorithms, Watson identifies equivalent and related hypotheses (for example, Abraham Lincoln and Honest Abe) and then enables custom merging per feature to combine scores.”

For more intelligent ranking, however, ranking and confidence estimation may be separated into two phases. In both phases sets of scores may be grouped according to their domain (for example type matching, passage scoring, and so on.) and intermediate models trained using ground truths and methods specific for that task. Using these intermediate models, the system produces an ensemble of intermediate scores. Motivated by hierarchical techniques such as mixture of experts (Jacobs et al. 1991) and stacked generalization (Wolpert 1992), a metalearner is trained over this ensemble. This approach allows for iteratively enhancing the system with more sophisticated and deeper hierarchical models while retaining flexibility for robustness and experimentation as scorers are modified and added to the system.

Netflix Prize - The BigChaos Solution to the Netflix Grand Prize “The other main driving force in the competition was the ensemble idea. The ensemble idea was part of the competition from the beginning and evolved over time. In the beginning, we used different models with different parametrization and a linear blending. The models were trained individually and the meta parameters got optimized to reduce the RMSE of the individual model. The linear blend was replaced by a nonlinear one, a neural network. This was basically the solution for the progress prize 2008, a ensemble of independently trained and tuned predictors, and a neural network for the blending. In fall 2008, we realized that training and optimizing the predictors individually is not optimal. Best blending results are achieved when the whole ensemble has the right tradeoff between diversity and accuracy. So we started to train the predictors sequentially and stopped the training when the blending improvement was best. Also the meta parameters were tuned, to achieve best blending performance. The next step in the evolution of the ensemble idea was to replace the single neural network blend by an ensemble of blends. In order to maximize diversity within the blending ensemble, the blends used different subsets of predictors and different blending methods.”

Related Patterns

Relationship to Canonical Patterns

  • Distributed Representation may also be a consequence of implicit ensembles created via the drop out method.
  • Self Similarity is a reflection that neural network are classifiers built using classifiers.
  • Modularity better generalization can be achieved by aggregating the predictions of a collection of ensembles.

Geoffrey Hinton's DropOut method is a consequence of the Ensemble Method. There is an alternative argument that DropOut can be explained as a Bayesian approximation (see: ).

Cited in these patterns:

References - •Bagging works because some underlying learning algorithms are unstable: slightly different inputs leads to very different outputs. If you can take advantage of this instability by running multiple instances, it can be shown that the reduced instability leads to lower error. If you want to understand why, the original bagging paper(…) has a section called “why bagging works” •Boosting works because of the focus on better defining the “decision edge”. By reweighting examples near the margin (the positive and negative examples) you get a reduced error A survey of multiple classifier systems as hybrid systems An analysis of diversity measures Dynamic selection of classifiers-A comprehensive review 7.11 Bagging and Other Ensemble Methods Bagging (short for bootstrap aggregating) is a technique for reducing generalization error by combining several models (Breiman, 1994). The idea is to train several different models separately, then have all of the models vote on the output for test examples. This is an example of a general strategy in machine learning called model averaging. Techniques employing this strategy are known as ensemble methods.

Jacobs, R.; Jordan, M. I.; Nowlan. S. J.; and Hinton, G. E. 1991. Adaptive Mixtures of Local Experts. Neural Computation 3(1): 79-–87.

Wolpert, D. H. 1992. Stacked Generalization. Neural Networks 5(2): 241–259. Unreasonable Effectiveness of Learning Neural Nets: Accessible States and Robust Ensembles

Define a cost-function given by a sum of a finite number of replicas of the original cost-function, with a constraint centering the replicas around a driving assignment. To illustrate this, we derive several powerful new algorithms, ranging from Markov Chains to message passing to gradient descent processes, where the algorithms target the robust dense states, resulting in substantial improvements in performance. Exclusivity Regularized Machine

A measurement of diversity, termed as exclusivity. With the designed exclusivity, we further propose an ensemble model, namely Exclusivity Regularized Machine (ERM), to jointly suppress the training error of ensemble and enhance the diversity between bases.

Although the diversity has no formal definition so far, the thing in common among studied measurements is that the diversity enforced in a pairwise form between members strikes a good balance between complexity and effectiveness. The evidence includes Q-statistics measure, correlation coefficient measure, disagreement measure, double-fault measure, k-statistic measure and mutual angular measure. Horizontal and Vertical Ensemble with Deep Representation for Classification Ensemble Robustness of Deep Learning Algorithms A stochastic learning algorithm can generalize well as long as its sensitiveness to adversarial perturbation is bounded in average, or equivalently, the performance variance of the algorithm is small. A Probabilistic Theory of Deep Learning Mathematical proof of the effectiveness of Ensembles Predictive Coarse-Graining

Product of experts (PoE) is a machine learning technique. It models a probability distribution by combining the output from several simpler distributions. It was proposed by Geoff Hinton, along with an algorithm for training the parameters of such a system.

The core idea is to combine several probability distributions (“experts”) by multiplying their density functions—making the PoE classification similar to an “and” operation. This allows each expert to make decisions on the basis of a few dimensions without having to cover the full dimensionality of a problem.

This is related to (but quite different from) a mixture model, where several probability distributions are combined via an “or” operation, which is a weighted sum of their density functions. Ensemble Robustness of Deep Learning Algorithms A Framework for the Cooperation of Learning Algorithms Unreasonable Effectiveness of Learning Neural Networks: From Accessible States and Robust Ensembles to Basic Algorithmic Schemes

The intuitive interpretation of the method is also quite straightforward: a set of coupled systems is less likely to get trapped in narrow minima, and will instead be attracted to wide regions of good (and mostly equivalent) configurations, thus naturally implementing a kind of robustness to details of the configurations. OUTRAGEOUSLY LARGE NEURAL NETWORKS: THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER

The capacity of a neural network to absorb information is limited by its number of parameters. In this work, we present a new kind of layer, the Sparsely-Gated Mixture-of-Experts (MoE), which can be used to effectively increase model capacity with only a modest increase in computation. This layer consists of up to tens of thousands of feed-forward sub-networks (experts) containing a total of up to tens of billions of parameters.

We find that increasing the model capacity well beyond the conventional approaches can lead to significant improvements on learning quality while incurring comparable or faster training time to the baseline models. SNAPSHOT ENSEMBLES: TRAIN 1, GET M FOR FREE

Ensembles of neural networks are known to give far more robust and accurate predictions compared to any of its individual networks. However, training multiple deep networks for model averaging is computationally expensive. In this paper, we propose a method to obtain the seemingly contradictory goal to obtain ensembles of multiple neural network at no additional training cost. We achieve this goal by letting a single neural network converge into several local minima along its optimization path and save the model parameters. To obtain repeated rapid convergence we leverage recent work on cyclic learning rate schedules. The resulting technique, which we refer to as Snapshot Ensembling, is surprisingly simple, yet effective. We show in a series of experiments that our approach is compatible with diverse network architectures and learning tasks. It consistently yields significantly lower error rates than state-of-the-art single models at no additional training cost, and almost matches the results of (far more expensive) independently trained network ensembles. On Cifar-10 and Cifar-100 our DenseNet Snapshot Ensembles obtain error rates of 3.4% and 17.4% respectively. Tensorial Mixture Models

However, when the priors tensor is decomposed it gives rise to an arithmetic circuit which in turn transforms the TMM into a Convolutional Arithmetic Circuit (ConvAC). A ConvAC corresponds to a shallow (single hidden layer) network when the priors tensor is decomposed by a CP (sum of rank-1) approach and corresponds to a deep network when the decomposition follows the Hierarchical Tucker (HT) model. DYNAMIC PARTITION MODELS

We present a new approach for learning compact and intuitive distributed representations with binary encoding. Rather than summing up expert votes as in products of experts, we employ for each variable the opinion of the most reliable expert. Data points are hence explained through a partitioning of the variables into expert supports. The partitions are dynamically adapted based on which experts are active. During the learning phase we adopt a smoothed version of this model that uses separate mixtures for each data dimension. In our experiments we achieve accurate reconstructions of high-dimensional data points with at most a dozen experts. Mimicking Ensemble Learning with Deep Branched Networks

This paper proposes a branched residual network for image classification. It is known that high-level features of deep neural network are more representative than lower-level features. By sharing the low-level features, the network can allocate more memory to high-level features. The upper layers of our proposed network are branched, so that it mimics the ensemble learning. By mimicking ensemble learning with single network, we have achieved better performance on ImageNet classification task. The Relative Performance of Ensemble Methods with Deep Convolutional Neural Networks for Image Classification

In this work, we investigated multiple widely used ensemble methods, including unweighted averaging, majority voting, the Bayes Optimal Classifier, and the (discrete) Super Learner, for image recognition tasks, with deep neural networks as candidate algorithms. ME R-CNN: Multi-Expert Region-based CNN for Object Detection

The proposed approach, referred to as multi-expert Region-based CNN (ME R-CNN), consists of three experts each responsible for objects with particular shapes: horizontally elongated, square-like, and vertically elongated. Each expert is a network with multiple fully connected layers and all the experts are preceded by a shared network which consists of multiple convolutional layers. Learning Two-Branch Neural Networks for Image-Text Matching Tasks

This paper investigates two-branch neural networks for image-text matching tasks. We propose two different network structures that produce different output representations. The first one, referred to as an embedding network, learns an explicit shared latent embedding space with a maximum-margin ranking loss and novel neighborhood constraints. The second one, referred to as a similarity network, fuses the two branches via element-wise product and is trained with regression loss to directly predict a similarity score. Extensive experiments show that our two-branch networks achieve high accuracies for phrase localization on the Flickr30K Entities dataset and for bi-directional image-sentence retrieval on Flickr30K and MSCOCO datasets. Snapshot Ensembles: Train 1, get M for free

Ensembles of neural networks are known to be much more robust and accurate than individual networks. However, training multiple deep networks for model averaging is computationally expensive. In this paper, we propose a method to obtain the seemingly contradictory goal of ensembling multiple neural networks at no additional training cost. We achieve this goal by training a single neural network, converging to several local minima along its optimization path and saving the model parameters. To obtain repeated rapid convergence, we leverage recent work on cyclic learning rate schedules. The resulting technique, which we refer to as Snapshot Ensembling, is simple, yet surprisingly effective. Geoffrey Hinton: “A Computational Principle that Explains Sex, the Brain, and Sparse Coding” PolyNet: A Pursuit of Structural Diversity in Very Deep Networks Coupled Ensembles of Neural Networks

We propose a reconfiguration of the model parameters into several parallel branches at the global network level, with each branch being a standalone CNN. We show that this arrangement is an efficient way to significantly reduce the number of parameters without losing performance or to significantly improve the performance with the same level of performance. The use of branches brings an additional form of regularization. In addition to the split into parallel branches, we propose a tighter coupling of these branches by placing the “fuse (averaging) layer” before the Log-Likelihood and SoftMax layers during training. A Scalable Bootstrap for Massive Data COUPLED ENSEMBLES OF NEURAL NETWORKS DISTRIBUTIONAL POLICY GRADIENTS BREAKING THE SOFTMAX BOTTLENECK: A HIGH-RANK RNN LANGUAGE MODEL

We formulate language modeling as a matrix factorization problem, and show that the expressiveness of Softmax-based models (including the majority of neural language models) is limited by a Softmax bottleneck. Given that natural language is highly context-dependent, this further implies that in practice Softmax with distributed word embeddings does not have enough capacity to model natural language.

Specifically, we introduce discrete latent variables into a recurrent language model, and formulate the next-token probability distribution as a Mixture of Softmaxes (MoS). Mixture of Softmaxes is more expressive than Softmax and other surrogates considered in prior work. Moreover, we show that MoS learns matrices that have much larger normalized singular values and thus much higher rank than Softmax and other baselines on real-world datasets.