# Cooperative Learning

**Aliases** Joint Training

**Intent**

Joint train a deep network with a conventional machine learning algorithm

**Motivation**

How can we leverage a deep network's generalization capabilities to augment a machine learning algorithm?

**Sketch**

<Diagram>

**Discussion**

We would like to leverage the generalization capabilities of a deep neural network in combination with our machine learning algorithms. The machine learning algorithm is similarly trained via gradient descent. We therefore can treat the machine algorithm as a cooperative branch with the deep learning network. We setup the Fitness function to involve terms from the deep neural network as well as terms from the machine learning algorithm. So for example, if we pair the neural network with a linear method then we have the following objective function:

$$ \sigma(w^T_{lin}[x, \phi(x)] + w^T_{deep} a^{(l_f)} + b) $$

where $w_{lin}$ and $w_{deep}$ refer to the weigh matrices for the linear and the deep learning models respectively. $\phi(x)$ refers to some combination of inputs. In the instance of Google's Wide and Deep Learning method, the $\phi(x)$ is cross product transformations of the features x.

**Known Uses**

Wide and Deep Learning

**Related Patterns**

<Diagram>

**References**

http://arxiv.org/abs/1606.07792 Wide & Deep Learning for Recommender Systems

Generalized linear models with nonlinear feature transformations are widely used for large-scale regression and classification problems with sparse inputs. Memorization of feature interactions through a wide set of cross-product feature transformations are effective and interpretable, while generalization requires more feature engineering effort. With less feature engineering, deep neural networks can generalize better to unseen feature combinations through low-dimensional dense embeddings learned for the sparse features. However, deep neural networks with embeddings can over-generalize and recommend less relevant items when the user-item interactions are sparse and high-rank. In this paper, we present Wide & Deep learning—jointly trained wide linear models and deep neural networks—to combine the benefits of memorization and generalization for recommender systems.

http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37013.pdf Follow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L1 Regularization

http://arxiv.org/abs/1607.02397v1 Enlightening Deep Neural Networks with Knowledge of Confounding Factors

We incorporate information on prominent auxiliary explanatory factors of the data population into existing architectures as secondary objective/loss blocks that take inputs from hidden layers during training.

http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41473.pdf DeViSE: A Deep Visual-Semantic Embedding Model

One remedy is to leverage data from other sources – such as text data – both to train visual models and to constrain their predictions. In this paper we present a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text.

http://papers.nips.cc/paper/308-a-framework-for-the-cooperation-of-learning-algorithms.pdf A Framework for the Cooperation of Learning Algorithms

We introduce a framework for training architectures composed of several modules. This framework, which uses a statistical formulation of learning systems, provides a unique formalism for describing many classical connectionist algorithms as well as complex systems where several algorithms interact. It allows to design hybrid systems which combine the advantages of connectionist algorithms as well as other learning algorithms.

http://arxiv.org/pdf/1607.00122v1.pdf Less-forgetting Learning in Deep Neural Networks

Our learning method uses the trained weights of the source network as the initial weights of the target network and minimizes two loss functions simultaneously.

http://arxiv.org/pdf/1509.03185v1.pdf Use it or Lose it: Selective Memory and Forgetting in a Perpetual Learning Machine

For each iteration of PSGD, a random class is chosen and from this input the recall DNN is used to synthesise the respective training image (from memory). This recalled training image is then used with the random class to train both networks for a single step of backprop SGD.

https://arxiv.org/abs/1604.01252 Comparative Deep Learning of Hybrid Representations for Image Recommendations

We design a dual-net deep network, in which the two sub-networks map input images and preferences of users into a same latent semantic space, and then the distances between images and users in the latent space are calculated to make decisions. We further propose a comparative deep learning (CDL) method to train the deep network, using a pair of images compared against one user to learn the pattern of their relative distances. The CDL embraces much more training data than naive deep learning, and thus achieves superior performance than the latter, with no cost of increasing network complexity.

https://arxiv.org/abs/1605.06676 Learning to Communicate with Deep Multi-Agent Reinforcement Learning

We consider the problem of multiple agents sensing and acting in environments with the goal of maximising their shared utility. In these environments, agents must learn communication protocols in order to share information that is needed to solve the tasks. By embracing deep neural networks, we are able to demonstrate end-to-end learning of protocols in complex environments inspired by communication riddles and multi-agent computer vision problems with partial observability. We propose two approaches for learning in these domains: Reinforced Inter-Agent Learning (RIAL) and Differentiable Inter-Agent Learning (DIAL). The former uses deep Q-learning, while the latter exploits the fact that, during learning, agents can backpropagate error derivatives through (noisy) communication channels. Hence, this approach uses centralised learning but decentralised execution. Our experiments introduce new environments for studying the learning of communication protocols and present a set of engineering innovations that are essential for success in these domains.

https://www.reddit.com/r/MachineLearning/comments/57ec9z/discussion_is_my_understanding_of_double/

In the original double Q-learning algorithm there are two action-value functions, and we update one of these for each sampled transition. More precisely, we update one value function, say Q1, towards the sum of the immediate reward and the value of the next state. To determine the value of the next state, we first find the best action according to Q1, but then we use the second value function, Q2, to determine the value of this action. Similarly, and symmetrically, when we update Q2 we use Q2 to determine the best action in the next state but we use Q1 to estimate the value of this action. The goal is to decorrelate the selection of the best action from the evaluation of this action. You don't need two symmetrically updated value functions to do this. In our follow-up work on Double DQN ( https://arxiv.org/abs/1509.06461 ) we instead used a slow moving copy to evaluate the best action according to the main Q network. This turns out to decorrelate the estimates sufficiently as well.

https://arxiv.org/pdf/1610.05182.pdf Learning and Transfer of Modulated Locomotor Controllers

A high-frequency, low-level “spinal” network with access to proprioceptive sensors learns sensorimotor primitives by training on simple tasks. This pre-trained module is fixed and connected to a low-frequency, high-level “cortical” network, with access to all sensors, which drives behavior by modulating the inputs to the spinal network.

Our design encourages the low-level controller to focus on the specifics of reactive motor control, while a high-level controller directs behavior towards the task goal by communicating a modulatory signal.

We believe that the general idea of reusing learned behavioral primitives is important, and the design principles we have followed represent possible steps towards this goal. Our hierarchical design with information hiding has enabled the construction of low-level motor behaviors that are sheltered from task-specific information, enabling their reuse.

https://arxiv.org/abs/1609.09408v1 Cooperative Training of Descriptor and Generator Networks

We observe that the two training algorithms can cooperate with each other by jumpstarting each other's Langevin sampling, and they can be naturally and seamlessly interwoven into a CoopNets algorithm that can train both nets simultaneously.

https://arxiv.org/abs/1610.10099v1 Neural Machine Translation in Linear Time

The ByteNet is a stack of two dilated convolutional neural networks, one to encode the source sequence and one to decode the target sequence.

https://arxiv.org/pdf/1611.09816v1.pdf Bounding the performance of co-adaptive learning over a countable space.

Co-adaptation is a special form of on-line learning where an algorithm A must assist an unknown algorithm B to perform some task. This is a general framework and has applications in recommendation systems, search, education, and much more. Here we will study the co-adaptive learning problem in the online, closed-loop setting. We will prove that, with high probability, co-adaptive learning is guaranteed to outperform learning with a fixed decoder as long as a particular condition is met.

https://arxiv.org/pdf/1609.09408.pdf Cooperative Training of Descriptor and Generator Networks

This paper studies the cooperative training of two probabilistic models of signals such as images. Both models are parametrized by convolutional neural networks (ConvNets). The first network is a descriptor network, which is an exponential family model or an energy-based model, whose feature statistics or energy function are defined by a bottom-up ConvNet, which maps the observed signal to the feature statistics. The second network is a generator network, which is a non-linear version of factor analysis. It is de- fined by a top-down ConvNet, which maps the latent factors to the observed signal. The maximum likelihood training algorithms of both the descriptor net and the generator net are in the form of alternating back-propagation, and both algorithms involve Langevin sampling. We observe that the two training algorithms can cooperate with each other by jumpstarting each other’s Langevin sampling, and they can be naturally and seamlessly interwoven into a CoopNets algorithm that can train both nets simultaneously.

https://arxiv.org/pdf/1609.03675v4.pdf Deep Coevolutionary Network: Embedding User and Item Features for Recommendation

As users interact with dierent items over time, user and item features can inuence each other, evolve and co-evolve over time. e compatibility of user and item’s feature further inuence the future interaction between users and items.

To address these limitations, we propose a novel deep coevolutionary network model (DeepCoevolve), for learning user and item features based on their interaction graph. DeepCoevolve use recurrent neural network (RNN) over evolving networks to dene the intensity function in point processes, which allows the model to capture complex mutual inuence between users and items, and the feature evolution over time. We also develop an ecient procedure for training the model parameters, and show that the learned models lead to signicant improvements in recommendation and activity prediction compared to previous state-of-the-arts parametric models

https://my.memo.ai/external/vJ-iBYK4TjS8aAZMj8LP Deep & Cross Network for Ad Click Predictions (Google 2017)

https://arxiv.org/pdf/1804.03782.pdf CoT: Cooperative Training for Generative Modeling

We proposed Cooperative Training, a powerful unbiased, low-variance, computationally efficient algorithm inspired by coordinate decent algorithms like GAN and Expectation Maximization. Models trained via CoT shows promising results in many sequential data modeling tasks. In this paper, we propose Cooperative Training (CoT), an efficient, low-variance, bias-free algorithm for training likelihood-based models by directly optimizing a well-estimated Jensen-Shannon divergence. CoT coordinately trains a generative module G, and an auxiliary predictive module M, called mediator, for guiding G in a cooperative fashion