# Gated Unit

**Name** Multiplicative Integration

**References**

http://arxiv.org/abs/1409.1259 On the Properties of Neural Machine Translation: Encoder-Decoder Approaches

http://arxiv.org/abs/1606.06630v1 On Multiplicative Integration with Recurrent Neural Networks

Multiplicative Integration can be viewed as a general way of combining information flows from two different sources. In particular, [29] proposed the ladder network that achieves promising results on semi-supervised learning. In their model, they combine the lateral connections and the backward connections via the “combinator” function by a Hadamard product.

http://arxiv.org/pdf/1606.05328v2.pdf Conditional Image Generation with PixelCNN Decoders

In our new architecture, we use two stacks of CNNs to deal with “blind spots” in the receptive field, which limited the original PixelCNN. Additionally, we use a gating mechanism which improves performance and convergence speed.

http://arxiv.org/abs/1406.1078v3 Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

http://arxiv.org/abs/1607.08378v1 Gated Siamese Convolutional Neural Network Architecture for Human Re-Identification

In this paper, we propose a gating function to selectively emphasize such fine common local patterns by comparing the mid-level features across pairs of images.

http://arxiv.org/pdf/1412.3555v1.pdf Gated Recurrent Unit

http://openreview.net/pdf?id=BJrFC6ceg PixelCNN++: A PixelCNN Implementation with Discretized Logistic Mixture Likelihood and Other Modifications

https://arxiv.org/abs/1610.00527 Video Pixel Networks

https://openreview.net/pdf?id=Hyvw0L9el GENERATING INTERPRETABLE IMAGES WITH CONTROLLABLE STRUCTURE

we proposed a new extension of PixelCNN that can accommodate both unstructured text and spatially-structured constraints for image synthesis. Our proposed model and the recent Generative Adversarial What-Where Networks both can condition on text and keypoints for image synthesis. However, these two approaches have complementary strengths. Given enough data GANs can quickly learn to generate high-resolution and sharp samples, and are fast enough at inference time for use in interactive applications (Zhu et al., 2016). Our model, since it is an extension of the autoregressive PixelCNN, can directly learn via maximum likelihood. It is very simple, fast and robust to train, and provides principled and meaningful progress benchmarks in terms of likelihood. We advanced the idea of conditioning on segmentations to improve both control and interpretability of the image samples. A possible direction for future work is to learn generative models of segmentation masks to guide subsequent image sampling. Finally, our results have demonstrated the ability of our model to perform controlled combinatorial image generation via manipulation of the input text and spatial constraints.

https://arxiv.org/abs/1611.05013v1 PixelVAE: A Latent Variable Model for Natural Images

Natural image modeling is a landmark challenge of unsupervised learning. Variational Autoencoders (VAEs) learn a useful latent representation and model global structure well but have difficulty capturing small details. PixelCNN models details very well, but lacks a latent code and is difficult to scale for capturing large structures. We present PixelVAE, a VAE model with an autoregressive decoder based on PixelCNN. Our model requires very few expensive autoregressive layers compared to PixelCNN and learns latent codes that are more compressed than a standard VAE while still capturing most non-trivial structure. Finally, we extend our model to a hierarchy of latent variables at different scales. Our model achieves state-of-the-art performance on binarized MNIST, competitive performance on 64×64 ImageNet, and high-quality samples on the LSUN bedrooms dataset.

https://arxiv.org/abs/1601.06759v3 Pixel Recurrent Neural Networks

Modeling the distribution of natural images is a landmark problem in unsupervised learning. This task requires an image model that is at once expressive, tractable and scalable. We present a deep neural network that sequentially predicts the pixels in an image along the two spatial dimensions. Our method models the discrete probability of the raw pixel values and encodes the complete set of dependencies in the image. Architectural novelties include fast two-dimensional recurrent layers and an effective use of residual connections in deep recurrent networks. We achieve log-likelihood scores on natural images that are considerably better than the previous state of the art. Our main results also provide benchmarks on the diverse ImageNet dataset. Samples generated from the model appear crisp, varied and globally coherent.

https://arxiv.org/abs/1511.01844v3 A note on the evaluation of generative models

Probabilistic generative models can be used for compression, denoising, inpainting, texture synthesis, semi-supervised learning, unsupervised feature learning, and other tasks. Given this wide range of applications, it is not surprising that a lot of heterogeneity exists in the way these models are formulated, trained, and evaluated. As a consequence, direct comparison between models is often difficult. This article reviews mostly known but often underappreciated properties relating to the evaluation and interpretation of generative models with a focus on image models. In particular, we show that three of the currently most commonly used criteria—average log-likelihood, Parzen window estimates, and visual fidelity of samples—are largely independent of each other when the data is high-dimensional. **Good performance with respect to one criterion therefore need not imply good performance with respect to the other criteria. Our results show that extrapolation from one criterion to another is not warranted and generative models need to be evaluated directly with respect to the application(s) they were intended for.** In addition, we provide examples demonstrating that **Parzen window estimates should generally be avoided.**

https://arxiv.org/abs/1610.10099v1 Neural Machine Translation in Linear Time

We present a neural architecture for sequence processing. The ByteNet is a stack of two dilated convolutional neural networks, one to encode the source sequence and one to decode the target sequence, where the target network unfolds dynamically to generate variable length outputs. The ByteNet has two core properties: it runs in time that is linear in the length of the sequences and it preserves the sequences' temporal resolution. The ByteNet decoder attains state-of-the-art performance on character-level language modelling and outperforms the previous best results obtained with recurrent neural networks. The ByteNet also achieves a performance on raw character-level machine translation that approaches that of the best neural translation models that run in quadratic time. The implicit structure learnt by the ByteNet mirrors the expected alignments between the sequences.

https://openreview.net/pdf?id=BJrFC6ceg PIXELCNN++: A PIXELCNN IMPLEMENTATION WITH DISCRETIZED LOGISTIC MIXTURE LIKELIHOOD AND OTHER MODIFICATIONS

PixelCNNs are a recently proposed class of powerful generative models with tractable likelihood. Here we discuss our implementation of PixelCNNs which we make available at https://github.com/openai/pixel-cnn. Our implementation contains a number of modifications to the original model that both simplify its structure and improve its performance. 1) We use a discretized logistic mixture likelihood on the pixels, rather than a 256-way softmax, which we find to speed up training. 2) We condition on whole pixels, rather than R/G/B sub-pixels, simplifying the model structure. 3) We use downsampling to efficiently capture structure at multiple resolutions. 4) We introduce additional short-cut connections to further speed up optimization. 5) We regularize the model using dropout. Finally, we present state-of-the-art log likelihood results on CIFAR-10 to demonstrate the usefulness of these modifications.

PixelCNN represents the current state-of-the-art in generative modeling when evaluated in terms of log-likelihood. Besides being used for modeling images, the PixelCNN model was recently extended to model audio (van den Oord et al., 2016a), video (Kalchbrenner et al., 2016b) and text (Kalchbrenner et al., 2016a).

http://www.dtic.upf.edu/~mblaauw/MdM_NIPS_seminar/

https://arxiv.org/abs/1505.00387v2 Highway Networks

The architecture is characterized by the use of gating units which learn to regulate the flow of information through a network. Highway networks with hundreds of layers can be trained directly using stochastic gradient descent and with a variety of activation functions, opening up the possibility of studying extremely deep and efficient architectures.

https://arxiv.org/abs/1612.08083v1 Language Modeling with Gated Convolutional Networks

The pre-dominant approach to language modeling to date is based on recurrent neural networks. In this paper we present a convolutional approach to language modeling. We introduce a novel gating mechanism that eases gradient propagation and which performs better than the LSTM-style gating of (Oord et al, 2016) despite being simpler. We achieve a new state of the art on WikiText-103 as well as a new best single-GPU result on the Google Billion Word benchmark. In settings where latency is important, our model achieves an order of magnitude speed-up compared to a recurrent baseline since computation can be parallelized over time. To our knowledge, this is the first time a non-recurrent approach outperforms strong recurrent models on these tasks.

https://arxiv.org/abs/1703.00381v1 The Statistical Recurrent Unit

Sophisticated gated recurrent neural network architectures like LSTMs and GRUs have been shown to be highly effective in a myriad of applications. We develop an un-gated unit, the statistical recurrent unit (SRU), that is able to learn long term dependencies in data by only keeping moving averages of statistics. The SRU's architecture is simple, un-gated, and contains a comparable number of parameters to LSTMs; yet, SRUs perform favorably to more sophisticated LSTM and GRU alternatives, often outperforming one or both in various tasks. We show the efficacy of SRUs as compared to LSTMs and GRUs in an unbiased manner by optimizing respective architectures' hyperparameters in a Bayesian optimization scheme for both synthetic and real-world tasks.

https://arxiv.org/abs/1704.00509v1 Truncating Wide Networks using Binary Tree Architectures

https://www.youtube.com/watch?v=ZSDrM-tuOiA Multiplicative Integration

https://arxiv.org/abs/1706.07230 Gated-Attention Architectures for Task-Oriented Language Grounding

To perform tasks specified by natural language instructions, autonomous agents need to extract semantically meaningful representations of language and map it to visual elements and actions in the environment. This problem is called task-oriented language grounding. We propose an end-to-end trainable neural architecture for task-oriented language grounding in 3D environments which assumes no prior linguistic or perceptual knowledge and requires only raw pixels from the environment and the natural language instruction as input. The proposed model combines the image and text representations using a Gated-Attention mechanism and learns a policy to execute the natural language instruction using standard reinforcement and imitation learning methods. We show the effectiveness of the proposed model on unseen instructions as well as unseen maps, both quantitatively and qualitatively. We also introduce a novel environment based on a 3D game engine to simulate the challenges of task-oriented language grounding over a rich set of instructions and environment states.

https://arxiv.org/abs/1711.02448 Cortical microcircuits as gated-recurrent neural networks

We introduce a recurrent neural network in which information is gated through inhibitory cells that are subtractive (subLSTM). We propose a natural mapping of subLSTMs onto known canonical excitatory-inhibitory cortical microcircuits.

https://arxiv.org/pdf/1712.01897.pdf Online Learning with Gated Linear Networks Rather than relying on non-linear transfer functions, our method gains representational power by the use of data conditioning. We state under general conditions a learnable capacity theorem that shows this approach can in principle learn any bounded Borel-measurable function on a compact subset of euclidean space; the result is stronger than many universality results for connectionist architectures because we provide both the model and the learning procedure for which convergence is guaranteed.

https://arxiv.org/abs/1712.01897 Online Learning with Gated Linear Networks Rather than relying on non-linear transfer functions, our method gains representational power by the use of data conditioning. We state under general conditions a learnable capacity theorem that shows this approach can in principle learn any bounded Borel-measurable function on a compact subset of euclidean space; the result is stronger than many universality results for connectionist architectures because we provide both the model and the learning procedure for which convergence is guaranteed.

https://arxiv.org/abs/1802.01569v1 Alleviating catastrophic forgetting using context-dependent gating and synaptic stabilization

Here, drawing inspiration from algorithms that are believed to be implemented in vivo, we propose a complementary method: adding a context-dependent gating signal, such that only sparse, mostly non-overlapping patterns of units are active for any one task. This method is easy to implement, requires little computational overhead, and allows ANNs to maintain high performance across large numbers of sequentially presented tasks when combined with weight stabilization.

https://github.com/tensorflow/tensor2tensor

http://homepages.inf.ed.ac.uk/tkomura/dog.pdf Mode-Adaptive Neural Networks for Quadruped Motion Control

. The system is composed of the motion prediction network and the gating network. At each frame, the motion prediction network computes the character state in the current frame given the state in the previous frame and the user-provided control signals. The gating network dynamically updates the weights of the motion prediction network by selecting and blending what we call the expert weights, each of which specializes in a particular movement. Due to the increased flexibility, the system can learn consistent expert weights across a wide range of non-periodic/periodic actions, from unstructured motion capture data, in an end-to-end fashion. In addition, the users are released from performing complex labeling of phases in different gaits. We show that this architecture is suitable for encoding the multimodality of quadruped locomotion and synthesizing responsive motion in real-time.

https://www.microsoft.com/en-us/research/wp-content/uploads/2017/05/r-net.pdf R-NET: MACHINE READING COMPREHENSION WITH SELF-MATCHING NETWORKS