# Probabilistic Graph Model Integration

https://arxiv.org/pdf/1602.06822.pdf Understanding Visual Concepts with Continuation Learning

We introduce a neural network architecture and a learning algorithm to produce factorized symbolic representations. We propose to learn these concepts by observing consecutive frames, letting all the components of the hidden representation except a small discrete set (gating units) be predicted from the previous frame, and let the factors of variation in the next frame be represented entirely by these discrete gated units (corresponding to symbolic representations). We demonstrate the efficacy of our approach on datasets of faces undergoing 3D transformations and Atari 2600 games.

The gated model. Each frame encoder produces a representation from its input. The gating head examines both these representations, then picks one component from the encoding of time t to pass through the gate. All other components of the hidden representation are from the encoding of time t− 1. As a result, each frame encoder predicts what it can about the next frame and encodes the “unpredictable” parts of the frame into one component.

http://arxiv.org/pdf/1603.08575v2.pdf Attend, Infer, Repeat: Fast Scene Understanding with Generative Models

We presented several principled models that not only learn to count, locate, classify and reconstruct the elements of a scene, but do so in a fraction of a second at test-time. The main ingredients are (a) building in meaning using appropriately structured models, (b) amortized inference that is attentive, iterative and variable-length, and © end-to-end learning. Learning is most successful when the variance of the gradients is low and the likelihood is well suited to the data.

AIR in practice: Left: The assumed generative model. Middle: AIR inference for this model. The contents of the grey box are input to the decoder. Right: Interaction between the inference and generation networks at every time-step. In our experiments the relationship between xiatt and y i att is modeled by a VAE, however any generative model of patches could be used (even, e.g., DRAW).

https://arxiv.org/abs/1610.05735 Deep Amortized Inference for Probabilistic Programs

To alleviate this problem, one could try to learn from past inferences, so that future inferences run faster. This strategy is known as amortized inference; it has recently been applied to Bayesian networks and deep generative models. This paper proposes a system for amortized inference in PPLs. In our system, amortization comes in the form of a parameterized guide program. Guide programs have similar structure to the original program, but can have richer data flow, including neural network components. These networks can be optimized so that the guide approximately samples from the posterior distribution defined by the original program. We present a flexible interface for defining guide programs and a stochastic gradient-based scheme for optimizing guide parameters, as well as some preliminary results on automatically deriving guide programs. We explore in detail the common machine learning pattern in which a 'local' model is specified by 'global' random values and used to generate independent observed data points; this gives rise to amortized local inference supporting global model learning.

https://blog.acolyer.org/2016/10/12/towards-deep-symbolic-reinforcement-learning/ Towards deep symbolic reinforcement learning

http://openreview.net/pdf?id=HJtN5K9gx STRUCTURED INTERPRETATION OF DEEP GENERATIVE MODELS

We demonstrate our framework’s flexibility in expressing a variety of different models, and evaluate its ability to learn disentangled representation from little supervision, and use this ability to perform classification and demonstrate generative ability on different datasets.

http://openreview.net/pdf?id=Hy6b4Pqee DEEP PROBABILISTIC PROGRAMMING

We propose Edward, a new Turing-complete probabilistic programming language which builds on two compositional representations—random variables and inference. We show how to integrate our language into existing computational graph frameworks such as TensorFlow; this provides significant speedups over existing probabilistic systems. We also show how Edward makes it easy to fit the same model using a variety of composable inference methods, ranging from point estimation, to variational inference, to MCMC. By treating inference as a first class citizen, on a par with modeling, we show that probabilistic programming can be as computationally efficient and flexible as traditional deep learning. For flexibility, we show how to reuse the modeling representation within inference to design variational auto-encoders and generative adversarial networks. For efficiency, we show that our implementation of Hamiltonian Monte Carlo is 35x faster than handoptimized software such as Stan.

https://arxiv.org/abs/1110.5667 Inducing Probabilistic Programs by Bayesian Program Merging

This report outlines an approach to learning generative models from data. We express models as probabilistic programs, which allows us to capture abstract patterns within the examples. By choosing our language for programs to be an extension of the algebraic data type of the examples, we can begin with a program that generates all and only the examples. We then introduce greater abstraction, and hence generalization, incrementally to the extent that it improves the posterior probability of the examples given the program. Motivated by previous approaches to model merging and program induction, we search for such explanatory abstractions using program transformations. We consider two types of transformation: Abstraction merges common subexpressions within a program into new functions (a form of anti-unification). Deargumentation simplifies functions by reducing the number of arguments. We demonstrate that this approach finds key patterns in the domain of nested lists, including parameterized sub-functions and stochastic recursion.

https://arxiv.org/pdf/1407.2646.pdf Learning Probabilistic Programs

https://arxiv.org/abs/1603.06277 Composing graphical models with neural networks for structured representations and fast inference https://www.youtube.com/watch?v=btr1poCYIzw

https://arxiv.org/abs/1707.03389v2 SCAN: Learning Abstract Hierarchical Compositional Visual Concepts

This paper describes SCAN (Symbol-Concept Association Network), a new framework for learning such concepts in the visual domain. We first use the previously published beta-VAE (Higgins et al., 2017a) architecture to learn a disentangled representation of the latent structure of the visual world, before training SCAN to extract abstract concepts grounded in such disentangled visual primitives through fast symbol association. Our approach requires very few pairings between symbols and images and makes no assumptions about the choice of symbol representations. Once trained, SCAN is capable of multimodal bi-directional inference, generating a diverse set of image samples from symbolic descriptions and vice versa. It also allows for traversal and manipulation of the implicit hierarchy of compositional visual concepts through symbolic instructions and learnt logical recombination operations. Such manipulations enable SCAN to invent and learn novel visual concepts through recombination of the few learnt concepts.

https://arxiv.org/pdf/1708.06438v1.pdf Sum-Product Graphical Models

This paper introduces a new probabilistic architecture called Sum-Product Graphical Model (SPGM). SPGMs combine traits from Sum-Product Networks (SPNs) and Graphical Models (GMs): Like SPNs, SPGMs always enable tractable inference using a class of models that incorporate context specific independence. Like GMs, SPGMs provide a high-level model interpretation in terms of conditional independence assumptions and corresponding factorizations. Thus, the new architecture represents a class of probability distributions that combines, for the first time, the semantics of graphical models with the evaluation efficiency of SPNs. We also propose a novel algorithm for learning both the structure and the parameters of SPGMs. A comparative empirical evaluation demonstrates competitive performances of our approach in density estimation.

https://arxiv.org/pdf/1711.09268v1.pdf GENERALIZING HAMILTONIAN MONTE CARLO WITH NEURAL NETWORKS

In this work, we presented a general method to train expressive MCMC kernels parameterized with deep neural networks. Given a target distribution p, analytically known up to a constant, our method provides a fast-mixing sampler, able to efficiently explore the state space.

https://arxiv.org/abs/1711.09268 Generalizing Hamiltonian Monte Carlo with Neural Networks

https://arxiv.org/abs/1801.10247 FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling

FastGCN not only is efficient for training but also generalizes well for inference. We show a comprehensive set of experiments to demonstrate its effectiveness compared with GCN and related models. In particular, training is orders of magnitude more efficient while predictions remain comparably accurate.

https://arxiv.org/abs/1802.05098 DiCE: The Infinitely Differentiable Monte-Carlo Estimator

To address all these shortcomings in a unified way, we introduce DiCE, which provides a single objective that can be differentiated repeatedly, generating correct gradient estimators of any order in SCGs. Unlike SL, DiCE relies on automatic differentiation for performing the requisite graph manipulations. We verify the correctness of DiCE both through a proof and through numerical evaluation of the DiCE gradient estimates. We also use DiCE to propose and evaluate a novel approach for multi-agent learning.

https://arxiv.org/abs/1804.03429v1 Graphical Generative Adversarial Networks

We propose Graphical Generative Adversarial Networks (Graphical-GAN) to model structured data. Graphical-GAN conjoins the power of Bayesian networks on compactly representing the dependency structures among random variables and that of generative adversarial networks on learning expressive dependency functions. We introduce a structured recognition model to infer the posterior distribution of latent variables given observations. We propose two alternative divergence minimization approaches to learn the generative model and recognition model jointly. The first one treats all variables as a whole, while the second one utilizes the structural information by checking the individual local factors defined by the generative model and works better in practice. Finally, we present two important instances of Graphical-GAN, i.e. Gaussian Mixture GAN (GMGAN) and State Space GAN (SSGAN), which can successfully learn the discrete and temporal structures on visual datasets, respectively.

https://arxiv.org/abs/1804.09720 JUNIPR: a Framework for Unsupervised Machine Learning in Particle Physics

https://arxiv.org/pdf/1612.09305.pdf ON EXTENDED ADMISSIBLE PROCEDURES AND THEIR NONSTANDARD BAYES RISK

Using methods from mathematical logic and nonstandard analysis, we introduce the class of nonstandard Bayes decision procedures—namely, those whose Bayes risk with respect to some prior is within an infinitesimal of the optimal Bayes risk

https://arxiv.org/abs/1808.03253v1 Counterfactual Normalization: Proactively Addressing Dataset Shift and Improving Reliability Using Causal Mechanisms

https://sites.google.com/view/solar-iclips SOLAR: Deep Structured Latent Representations for Model-Based Reinforcement Learning

Model-based reinforcement learning (RL) methods can be broadly categorized as global model methods, which depend on learning models that provide sensible predictions in a wide range of states, or local model methods, which iteratively refit simple models that are used for policy improvement. While predicting future states that will result from the current actions is difficult, local model methods only attempt to understand system dynamics in the neighborhood of the current policy, making it possible to produce local improvements without ever learning to predict accurately far into the future. The main idea in this paper is that we can learn representations that make it easy to retrospectively infer simple dynamics given the data from the current policy, thus enabling local models to be used for policy learning in complex systems. To that end, we focus on learning representations with probabilistic graphical model (PGM) structure, which allows us to devise an efficient local model method that infers dynamics from real-world rollouts with the PGM as a global prior. We compare our method to other model-based and model-free RL methods on a suite of robotics tasks, including manipulation tasks on a real Sawyer robotic arm directly from camera images.

https://arxiv.org/abs/1808.08485v1 Deep Probabilistic Logic: A Unifying Framework for Indirect Supervision

. In this paper, we propose deep probabilistic logic (DPL) as a general framework for indirect supervision, by composing probabilistic logic with deep learning. DPL models label decisions as latent variables, represents prior knowledge on their relations using weighted first-order logical formulas, and alternates between learning a deep neural network for the end task and refining uncertain formula weights for indirect supervision, using variational EM. This framework subsumes prior indirect supervision methods as special cases, and enables novel combination via infusion of rich domain and linguistic knowledge.

http://hanover.azurewebsites.net/

https://arxiv.org/abs/1610.05735v1 Deep Amortized Inference for Probabilistic Programs https://arxiv.org/abs/1603.06143

https://arxiv.org/pdf/1811.02091.pdf Simple, Distributed, and Accelerated Probabilistic Programming https://github.com/google-research/google-research/tree/master/simple_probabilistic_programming

We describe a simple, low-level approach for embedding probabilistic programming in a deep learning ecosystem. In particular, we distill probabilistic programming down to a single abstraction—the random variable. Our lightweight implementation in TensorFlow enables numerous applications: a model-parallel variational auto-encoder (VAE) with 2nd-generation tensor processing units (TPUv2s); a data-parallel autoregressive model (Image Transformer) with TPUv2s; and multiGPU No-U-Turn Sampler (NUTS). For both a state-of-the-art VAE on 64×64 ImageNet and Image Transformer on 256×256 CelebA-HQ, our approach achieves an optimal linear speedup from 1 to 256 TPUv2 chips. With NUTS, we see a 100x speedup on GPUs over Stan and 37x over PyMC3.