https://arxiv.org/pdf/1609.04767v1.pdf Transport-based analysis, modeling, and learning from signal and data distributions

the geometric characteristics of transport-related metrics have inspired new kinds of algorithms for interpreting the meaning of data distributions. Here we provide an overview of the mathematical underpinnings of mass transport-related methods, including numerical implementation, as well as a review, with demonstrations, of several applications.

https://arxiv.org/pdf/1610.06447v1.pdf Regularized Optimal Transport and the Rot Mover’s Distance

https://arxiv.org/pdf/1611.07573.pdf Relaxed Earth Mover’s Distances for Chain- and Tree-connected Spaces and their use as a Loss Function in Deep Learning

The Earth Mover’s Distance (EMD) computes the optimal cost of transforming one distribution into another, given a known transport metric between them. In deep learning, the EMD loss allows us to embed information during training about the output space structure like hierarchical or semantic relations. This helps in achieving better output smoothness and generalization. However EMD is computationally expensive. Moreover, solving EMD optimization problems usually require complex techniques like lasso. These properties limit the applicability of EMD-based approaches in large scale machine learning.

http://faculty.virginia.edu/rohde/transport/publications.html

https://arxiv.org/pdf/1611.05916v3.pdf Squared Earth Mover’s Distance-based Loss for Training Deep Neural Networks

The squared EMD loss uses the predicted probabilities of all classes and penalizes the misspredictions accordingly. In experiments, we evaluate our squared EMD loss in ordered-classes datasets such as age estimation and image aesthetic judgment. We also generalize the squared EMD loss to classification datasets with orderless-classes such as the ImageNet. Our results show that the squared EMD loss allows networks to achieve lower errors than the standard cross-entropy loss, and result in state-of-the-art performances on two age estimation datasets and one image aesthetic judgment dataset.

https://arxiv.org/pdf/1710.10044.pdf Distributional Reinforcement Learning with Quantile Regression

The importance of the distribution over returns in reinforcement learning has been (re)discovered and highlighted many times by now. In Bellemare, Dabney, and Munos (2017) the idea was taken a step further, and argued to be a central part of approximate reinforcement learning. However, the paper left open the question of whether there exists an algorithm which could bridge the gap between Wasserstein-metric theory and practical concerns.

https://openreview.net/forum?id=ry-TW-WAb Improving GANs Using Optimal Transport

We present Optimal Transport GAN (OT-GAN), a variant of generative adversarial nets minimizing a new metric measuring the distance between the generator distribution and the data distribution. This metric, which we call mini-batch energy distance, combines optimal transport in primal form with an energy distance defined in an adversarially learned feature space, resulting in a highly discriminative distance function with unbiased mini-batch gradients. Experimentally we show OT-GAN to be highly stable when trained with large mini-batches, and we present state-of-the-art results on several popular benchmark problems for image generation.

https://openreview.net/forum?id=HkL7n1-0b Wasserstein Auto-Encoders

We propose a new auto-encoder based on the Wasserstein distance, which improves on the sampling properties of VAE.

https://arxiv.org/abs/1803.00250v1 Wasserstein Distance Measure Machines

Our experimental results show that this Wasserstein distance embedding performs better than kernel mean embeddings and computing Wasserstein distance is far more tractable than estimating pairwise Kullback-Leibler divergence of empirical distributions.

https://arxiv.org/abs/1803.05573v1 Improving GANs Using Optimal Transport

We present Optimal Transport GAN (OT-GAN), a variant of generative adversarial nets minimizing a new metric measuring the distance between the generator distribution and the data distribution. This metric, which we call mini-batch energy distance, combines optimal transport in primal form with an energy distance defined in an adversarially learned feature space, resulting in a highly discriminative distance function with unbiased mini-batch gradients. Experimentally we show OT-GAN to be highly stable when trained with large mini-batches, and we present state-of-the-art results on several popular benchmark problems for image generation.

https://arxiv.org/abs/1803.00567v1 Computational Optimal Transport

https://openreview.net/forum?id=HkL7n1-0b Wasserstein Auto-Encoders github.com/tolstikhin/wae

https://arxiv.org/abs/1808.09663 Wasserstein is all you need

https://arxiv.org/abs/1809.00013v1 Gromov-Wasserstein Alignment of Word Embedding Spaces

building on the idea that word embeddings arise from metric recovery algorithms.

https://arxiv.org/abs/1810.03032v1 Constructing Graph Node Embeddings via Discrimination of Similarity Distributions

The problem of unsupervised learning node embeddings in graphs is one of the important directions in modern network science. In this work we propose a novel framework, which is aimed to find embeddings by \textit{discriminating distributions of similarities (DDoS)} between nodes in the graph. The general idea is implemented by maximizing the \textit{earth mover distance} between distributions of decoded similarities of similar and dissimilar nodes. The resulting algorithm generates embeddings which give a state-of-the-art performance in the problem of link prediction in real-world graphs.

https://arxiv.org/abs/1810.11447v1 Scalable Unbalanced Optimal Transport using Generative Adversarial Networks

We formulate unbalanced OT as a problem of simultaneously learning a transport map and a scaling factor that push a source measure to a target measure in a cost-optimal manner. In addition, we propose an algorithm for solving this problem based on stochastic alternating gradient updates, similar in practice to GANs. We also provide theoretical justification for this formulation, showing that it is closely related to an existing static formulation by Liero et al. (2018), and perform numerical experiments demonstrating how this methodology can be applied to population modeling.