Also: Conditional Computation

https://arxiv.org/abs/1308.3432 Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Stochastic neurons and hard non-linearities can be useful for a number of reasons in deep learning models, but in many cases they pose a challenging problem: how to estimate the gradient of a loss function with respect to the input of such stochastic or non-smooth neurons? I.e., can we “back-propagate” through these stochastic neurons? We examine this question, existing approaches, and compare four families of solutions, applicable in different settings. One of them is the minimum variance unbiased gradient estimator for stochatic binary neurons (a special case of the REINFORCE algorithm). A second approach, introduced here, decomposes the operation of a binary stochastic neuron into a stochastic binary part and a smooth differentiable part, which approximates the expected effect of the pure stochatic binary neuron to first order. A third approach involves the injection of additive or multiplicative noise in a computational graph that is otherwise differentiable. A fourth approach heuristically copies the gradient with respect to the stochastic output directly as an estimator of the gradient with respect to the sigmoid argument (we call this the straight-through estimator).

https://openreview.net/pdf?id=S1jE5L5gl The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables

The concrete distribution is a new family of distributions with closed form densities and a simple reparameterization. Whenever a discrete stochastic node of a computation graph can be refactored into a one-hot bit representation that is treated continuously, concrete stochastic nodes can be used with automatic differentiation to produce low-variance biased gradients of objectives (including objectives that depend on the log-probability of latent stochastic nodes) on the corresponding discrete graph. We demonstrate effectiveness of concrete relaxations on density estimation and structured prediction tasks using neural networks.

https://openreview.net/pdf?id=ryMxXPFex Discrete Variational Autoencoders

We present a novel method to train a class of probabilistic models with discrete latent variables using the variational autoencoder framework, including backpropagation through the discrete latent variables. The associated class of probabilistic models comprises an undirected discrete component and a directed hierarchical continuous component. The discrete component captures the distribution over the disconnected smooth manifolds induced by the continuous component.

https://arxiv.org/abs/1612.00563 Self-critical Sequence Training for Image Captioning

Recently it has been shown that policy-gradient methods for reinforcement learning can be utilized to train deep end-to-end systems directly on non-differentiable metrics for the task at hand. In this paper we consider the problem of optimizing image captioning systems using reinforcement learning, and show that by carefully optimizing our systems using the test metrics of the MSCOCO task, significant gains in performance can be realized. Our systems are built using a new optimization approach that we call self-critical sequence training (SCST). SCST is a form of the popular REINFORCE algorithm that, rather than estimating a “baseline” to normalize the rewards and reduce variance, utilizes the output of its own test-time inference algorithm to normalize the rewards it experiences. Using this approach, estimating the reward signal (as actor-critic methods must do) and estimating normalization (as REINFORCE algorithms typically do) is avoided, while at the same time harmonizing the model with respect to its test-time inference procedure.

http://r2rt.com/binary-stochastic-neurons-in-tensorflow.html

https://arxiv.org/abs/1609.02200v2 Discrete Variational Autoencoders

https://arxiv.org/pdf/1710.11573.pdf DEEP LEARNING AS A MIXED CONVEXCOMBINATORIAL OPTIMIZATION PROBLEM

https://arxiv.org/abs/1711.00937 **Neural Discrete Representation Learning**

https://arxiv.org/pdf/1704.00648.pdf Soft-to-Hard Vector Quantization for End-to-End Learning Compressible Representations

https://openreview.net/pdf?id=BJRZzFlRb COMPRESSING WORD EMBEDDINGS VIA DEEP COMPOSITIONAL CODE LEARNING

https://arxiv.org/pdf/1711.03067.pdf Learning K-way D-dimensional Discrete Code For Compact Embedding Representations

https://arxiv.org/pdf/1802.04223.pdf SparseMAP: Differentiable Sparse Structured Inference

https://arxiv.org/abs/1803.03382v3 Fast Decoding in Sequence Models using Discrete Latent Variables

A method to extend sequence models using discrete latent variables that makes decoding much more parallelizable. We first auto-encode the target sequence into a shorter sequence of discrete latent variables, which at inference time is generated autoregressively, and finally decode the output sequence from this shorter latent sequence in parallel. To this end, we introduce a novel method for constructing a sequence of discrete latent variables and compare it with previously introduced methods.

https://arxiv.org/abs/1801.09797v1 Discrete Autoencoders for Sequence Models

We propose to improve the representation in sequence models by augmenting current approaches with an autoencoder that is forced to compress the sequence through an intermediate discrete latent space. In order to propagate gradients though this discrete representation we introduce an improved semantic hashing technique. We show that this technique performs well on a newly proposed quantitative efficiency measure. We also analyze latent codes produced by the model showing how they correspond to words and phrases. Finally, we present an application of the autoencoder-augmented model to generating diverse translations.

https://arxiv.org/abs/1803.05252v2 Algebraic Machine Learning

Here we propose a different approach to learning and generalization that is parameter-free, fully discrete and that does not use function minimization. We use the training data to find an algebraic representation with minimal size and maximal freedom, explicitly expressed as a product of irreducible components. This algebraic representation is shown to directly generalize, giving high accuracy in test data, more so the smaller the representation. We prove that the number of generalizing representations can be very large and the algebra only needs to find one. We also derive and test a relationship between compression and error rate. We give results for a simple problem solved step by step, hand-written character recognition, and the Queens Completion problem as an example of unsupervised learning. As an alternative to statistical learning, algebraic learning may offer advantages in combining bottom-up and top-down information, formal concept derivation from data and large-scale parallelization.

https://arxiv.org/abs/1701.07879v4 A Radically New Theory of how the Brain Represents and Computes with Probabilities

The theory, Sparsey, was introduced 20+ years ago as a canonical cortical circuit/algorithm model achieving efficient sequence learning/recognition, but not elaborated as an alternative to PPC theories. Here, we show that: a) the active SDR simultaneously represents both the most similar/likely input and the entire (coarsely-ranked) similarity likelihood/distribution over all stored inputs (hypotheses); and b) given an input, the SDR code selection algorithm, which underlies both learning and inference, updates both the most likely hypothesis and the entire likelihood distribution (cf. belief update) with a number of steps that remains constant as the number of stored items increases.

https://arxiv.org/abs/1705.07177 Model-Based Planning with Discrete and Continuous Actions

https://arxiv.org/pdf/1804.01508v5.pdf The Tsetlin Machine – A Game Theoretic Bandit Driven Approach to Optimal Pattern Recognition with Propositional Logic

https://github.com/cair/TsetlinMachine

https://www.semanticscholar.org/paper/Fast-Decoding-in-Sequence-Models-Using-Discrete-Kaiser-Roy/a766eeaf6a3dd6af81a0828893b7fb16718aa769 Fast Decoding in Sequence Models Using Discrete Latent Variables

we present a method to extend sequence models using discrete latent variables that makes decoding much more parallelizable. We first autoencode the target sequence into a shorter sequence of discrete latent variables, which at inference time is generated autoregressively, and finally decode the output sequence from this shorter latent sequence in parallel. To this end, we introduce a novel method for constructing a sequence of discrete latent variables and compare it with previously introduced methods.

https://arxiv.org/pdf/1804.08597v1.pdf Towards Symbolic Reinforcement Learning with Common Sense

. In this paper, we propose a novel extension of DSRL, which we call Symbolic Reinforcement Learning with Common Sense (SRL+CS), offering a better balance between generalization and specialization, inspired by principles of common sense when assigning rewards and aggregating Q-values. Experiments reported in this paper show that SRL+CS learns consistently faster than Q-learning and DSRL, achieving also a higher accuracy

https://arxiv.org/abs/1803.01299 An Optimal Control Approach to Deep Learning and Applications to Discrete-Weight Neural Networks

Deep learning is formulated as a discrete-time optimal control problem. This allows one to characterize necessary conditions for optimality and develop training algorithms that do not rely on gradients with respect to the trainable parameters. In particular, we introduce the discrete-time method of successive approximations (MSA), which is based on the Pontryagin's maximum principle, for training neural networks.

https://arxiv.org/abs/1802.04223 SparseMAP: Differentiable Sparse Structured Inference

https://arxiv.org/abs/1805.11063 Theory and Experiments on Vector Quantized Autoencoders

https://arxiv.org/abs/1806.01363v1 Playing Atari with Six Neurons

State representations are generated by a novel algorithm based on Vector Quantization and Sparse Coding, trained online along with the network, and capable of growing its dictionary size over time. We also introduce new techniques allowing both the neural network and the evolution strategy to cope with varying dimensions. This enables networks of only 6 to 18 neurons to learn to play a selection of Atari games with performance comparable—and occasionally superior—to state-of-the-art techniques using evolution strategies on deep networks two orders of magnitude larger.

https://arxiv.org/abs/1808.09111v1 Unsupervised Learning of Syntactic Structure with Invertible Neural Projections

. In this work, we propose a novel generative model that jointly learns discrete syntactic structure and continuous word representations in an unsupervised fashion by cascading an invertible neural network with a structured generative prior. We show that the invertibility condition allows for efficient exact inference and marginal likelihood computation in our model so long as the prior is well-behaved.

https://arxiv.org/pdf/1808.09111v1.pdf Unsupervised Learning of Syntactic Structure with Invertible Neural Projections

we propose a novel generative model that jointly learns discrete syntactic structure and continuous word representations in an unsupervised fashion by cascading an invertible neural network with a structured generative prior.

https://arxiv.org/abs/1901.00409v1 Discrete Neural Processes

In this work we develop methods for efficient amortized approximate Bayesian inference over discrete combinatorial spaces, with applications to random permutations, probabilistic clustering (such as Dirichlet process mixture models) and random communities (such as stochastic block models). The approach is based on mapping distributed, symmetry-invariant representations of discrete arrangements into conditional probabilities. The resulting algorithms parallelize easily, yield iid samples from the approximate posteriors, and can easily be applied to both conjugate and non-conjugate models, as training only requires samples from the generative model.