# Variational Autoencoder

**Intent** Guide a network to construct features that are sourced from a non-parametric distribution.

**Motivation** How can we leverage non-parametric distributions in our networks

Structure

<Diagram>

**Discussion**

The key here is to have a layer that is generated via a non-parametric distribution and that layer being differentiable to allow for back-propagation.

Known Uses

Related Patterns

<Diagram>

**References**

http://www.deeplearningbook.org/contents/generative_models.html The variational autoencoder or VAE (Kingma, 2013; Rezende et al., 2014) is a directed model that uses learned approximate inference and can be trained purely with gradient-based methods

http://arxiv.org/abs/1312.6114 Auto-Encoding Variational Bayes

We introduce a stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case. Our contributions is two-fold. First, we show that a reparameterization of the variational lower bound yields a lower bound estimator that can be straightforwardly optimized using standard stochastic gradient methods. Second, we show that for i.i.d. datasets with continuous latent variables per datapoint, posterior inference can be made especially efficient by fitting an approximate inference model (also called a recognition model) to the intractable posterior using the proposed lower bound estimator.

http://arxiv.org/abs/1606.02185v1 Towards a Neural Statistician

An efficient learner is one who reuses what they already know to tackle a new problem. For a machine learner, this means understanding the similarities amongst datasets. In order to do this, one must take seriously the idea of working with datasets, rather than datapoints, as the key objects to model. Towards this goal, we demonstrate an extension of a variational autoencoder that can learn a method for computing representations, or statistics, of datasets in an unsupervised fashion. The network is trained to produce statistics that encapsulate a generative model for each dataset. Hence the network enables efficient learning from new datasets for both unsupervised and supervised tasks. We show that we are able to learn statistics that can be used for: clustering datasets, transferring generative models to new datasets, selecting representative samples of datasets and classifying previously unseen classes.

http://arxiv.org/pdf/1606.05908v1.pdf Tutorial on Variational Autoencoders

training-time variational autoencoder implemented as a feedforward neural network, where P(X|z) is Gaussian. Left is without the “reparameterization trick”, and right is with it. Red shows sampling operations that are non-differentiable. Blue shows loss layers. The feedforward behavior of these networks is identical, but backpropagation can be applied only to the right network.

The testing-time variational “autoencoder,” which allows us to generate new samples. The “encoder” pathway is simply discarded. Left: a training-time conditional variational autoencoder implemented as a feedforward neural network, following the same notation as Figure 4. Right: the same model at test time, when we want to sample from P(Y|X).

http://arxiv.org/pdf/1511.01844v2.pdf A NOTE ON THE EVALUATION OF GENERATIVE MODELS

An evaluation based on samples is biased towards models which overfit and therefore a poor indicator of a good density model in a log-likelihood sense, which favors models with large entropy. Conversely, a high likelihood does not guarantee visually pleasing samples.

We therefore argue Parzen window estimates should be avoided for evaluating generative models, unless the application specifically requires such a loss function. In this case, we have shown that a k-means based model can perform better than the true density. To summarize, our results demonstrate that for generative models there is no one-fits-all loss function but a proper assessment of model performance is only possible in the the context of an application.

http://arxiv.org/abs/1509.00519 Importance Weighted Autoencoders

We present the importance weighted autoencoder (IWAE), a generative model with the same architecture as the VAE, but which uses a strictly tighter log-likelihood lower bound derived from importance weighting. In the IWAE, the recognition network uses multiple samples to approximate the posterior, giving it increased flexibility to model complex posteriors which do not fit the VAE modeling assumptions. We show empirically that IWAEs learn richer latent space representations than VAEs, leading to improved test log-likelihood on density estimation benchmarks.

http://arxiv.org/pdf/1607.05690v1.pdf Stochastic Backpropagation through Mixture Density Distributions

The ability to backpropagate stochastic gradients through continuous latent distributions has been crucial to the emergence of variational autoencoders. The key ingredient is an unbiased and low-variance way of estimating gradients with respect to distribution parameters from gradients evaluated at distribution samples. The “reparameterization trick” [6] provides a class of transforms yielding such estimators for many continuous distributions, including the Gaussian and other members of the location-scale family. However the trick does not readily extend to mixture density models, due to the difficulty of reparameterizing the discrete distribution over mixture weights.

https://en.wikipedia.org/wiki/Distribution_(mathematics)

https://arxiv.org/abs/1606.04934 Improving Variational Inference with Inverse Autoregressive Flow

We find that by inverting autoregressive networks we can obtain equally powerful data transformations that can often be computed in parallel. We show that such data transformations, inverse autoregressive flows (IAF), can be used to transform a simple distribution over the latent variables into a much more flexible distribution, while still allowing us to compute the resulting variables' probability density function. The method is simple to implement, can be made arbitrarily flexible, and (in contrast with previous work) is naturally applicable to latent variables that are organized in multidimensional tensors, such as 2D grids or time series. The method is applied to a novel deep architecture of variational auto-encoders.

http://blog.fastforwardlabs.com/post/148842796218/introducing-variational-autoencoders-in-prose-and

https://openreview.net/pdf?id=B1M8JF9xx ON THE QUANTITATIVE ANALYSIS OF DECODERBASED GENERATIVE MODELS

We propose to use Annealed Importance Sampling for evaluating log-likelihoods for decoder-based models and validate its accuracy using bidirectional Monte Carlo. Using this technique, we analyze the performance of decoder-based models, the effectiveness of existing log-likelihood estimators, the degree of overfitting, and the degree to which these models miss important modes of the data distribution.

https://arxiv.org/pdf/1603.02514v3.pdf Variational Autoencoder for Semi-supervised Text Classification

a novel optimization method is proposed, which estimates the gradient of the unlabeled objective function by sampling, along with two variance reduction techniques.

https://arxiv.org/pdf/1511.06349v2.pdf GENERATING SENTENCES FROM A CONTINUOUS SPACE

This factorization allows it to explicitly model holistic properties of sentences such as style, topic, and high-level syntactic features. Samples from the prior over these sentence representations remarkably produce diverse and well-formed sentences through simple deterministic decoding. By examining paths through this latent space, we are able to generate coherent novel sentences that interpolate between known sentences. We present techniques for solving the difficult learning problem presented by this model, demonstrate strong performance in the imputation of missing tokens, and explore many interesting properties of the latent sentence space.

https://www.semanticscholar.org/paper/Generating-Sentences-from-a-Continuous-Space-Bowman-Vilnis/3d1427961edccf8940a360d203e44539db58a60f Generating Sentences from a Continuous Space

https://arxiv.org/pdf/1604.08772v1.pdf Towards Conceptual Compression

We introduce a simple recurrent variational autoencoder architecture that significantly improves image modeling. The system represents the state-of-the-art in latent variable models for both the ImageNet and Omniglot datasets. We show that it naturally separates global conceptual information from lower level details, thus addressing one of the fundamentally desired properties of unsupervised learning. Furthermore, the possibility of restricting ourselves to storing only global information about an image allows us to achieve high quality ‘conceptual compression’.

http://int8.io/variational-autoencoder-in-tensorflow/

https://arxiv.org/abs/1702.08658v1 Towards Deeper Understanding of Variational Autoencoding Models

We propose a new family of optimization criteria for variational auto-encoding models, generalizing the standard evidence lower bound. We provide conditions under which they recover the data distribution and learn latent features, and formally show that common issues such as blurry samples and uninformative latent features arise when these conditions are not met. Based on these new insights, we propose a new sequential VAE model that can generate sharp samples on the LSUN image dataset based on pixel-wise reconstruction loss, and propose an optimization criterion that encourages unsupervised learning of informative latent features.

https://arxiv.org/pdf/1704.05155v1.pdf Stein Variational Autoencoder

A new method for learning variational autoencoders is developed, based on an application of Stein's operator. The framework represents the encoder as a deep nonlinear function through which samples from a simple distribution are fed. One need not make parametric assumptions about the form of the encoder distribution, and performance is further enhanced by integrating the proposed encoder with importance sampling. Example results are demonstrated across multiple unsupervised and semi-supervised problems, including semi-supervised analysis of the ImageNet data, demonstrating the scalability of the model to large datasets.

https://arxiv.org/abs/1804.00891 Hyperspherical Variational Auto-Encoders But although the default choice of a Gaussian distribution for both the prior and posterior represents a mathematically convenient distribution often leading to competitive results, we show that this parameterization fails to model data with a latent hyperspherical structure. To address this issue we propose using a von Mises-Fisher (vMF) distribution instead, leading to a hyperspherical latent space. Through a series of experiments we show how such a hyperspherical VAE, or -VAE, is more suitable for capturing data with a hyperspherical latent structure, while outperforming a normal, -VAE, in low dimensions on other data types. https://github.com/nicola-decao/s-vae

With the S-VAE we set an important first step in the exploration of hyperspherical latent representations for variational auto-encoders. Through various experiments, we have shown that S-VAEs have a clear advantage over N -VAEs for data residing on a known hyperspherical manifold, and are competitive or surpass N -VAEs for data with a non-obvious hyperspherical latent representation in lower dimensions. Specifically, we demonstrated S-VAEs improve separability in semi-supervised classification and that they are able to improve results on state-of-the-art link prediction models on citation graphs, by merely changing the prior and posterior distributions as a simple drop-in replacement.

https://arxiv.org/pdf/1804.02476.pdf Associative Compression Networks

This paper introduces Associative Compression Networks (ACNs), a new framework for variational autoencoding with neural networks. The system differs from existing variational autoencoders in that the prior distribution used to model each code is conditioned on a similar code from the dataset. In compression terms this equates to sequentially transmitting the data using an ordering determined by proximity in latent space. As the prior need only account for local, rather than global variations in the latent space, the coding cost is greatly reduced, leading to rich, informative codes, even when autoregressive decoders are used.

https://arxiv.org/pdf/1803.03764.pdf Variance Networks: When Expectation Does Not Meet Your Expectations

Each weight of a variance layer follows a zero-mean distribution and is only parameterized by its variance. We show that such layers can learn surprisingly well, can serve as an efficient exploration tool in reinforcement learning tasks and provide a decent defense against adversarial attacks.

https://arxiv.org/abs/1807.08919v1 The Variational Homoencoder: Learning to learn high capacity generative models from few examples

We use the VHE framework to learn a hierarchical PixelCNN on the Omniglot dataset, which outperforms all existing models on test set likelihood and achieves strong performance on one-shot generation and classification tasks. We additionally validate the VHE on natural images from the YouTube Faces database. Finally, we develop extensions of the model that apply to richer dataset structures such as factorial and hierarchical categories. https://github.com/insperatum/vhe

https://arxiv.org/pdf/1808.10805.pdf Spherical Latent Spaces for Stable Variational Autoencoders

An analysis of the properties of our vMF representations shows that they learn richer and more nuanced structures in their latent representations than their Gaussian counterparts.