Edit: https://docs.google.com/a/codeaudit.com/document/d/1xdZkMtYH_NuY6Rk2bKWOLwAgwyilY8YzzYY0sZ8BMXs/edit?usp=sharing

Name Neural Embedding (aka Vectorization, Word2Vec, *2Vec)

Intent Analogy through algebra.

Complex features can be projected into lower dimensions while capture intrinsic semantics.


Complex features can exists at extremely high dimensions and thus requiring an unbounded amount of computational resources to perform classification.




Known Uses

Related Patterns


  • Similarity Operator
  • Random Projections
  • Distributed Model
  • Disentangled Basis
  • Model Interpretability
  • Recurrent Layer
  • Dimensionless Features
  • Categorical Data
  • Sketching
  • Dimensional Reduction
  • Fingerprinting
  • Graph Embedding
  • Reusable Representation
  • Lower Dimensional Visualization



Neural Word Embeddings as Implicit Matrix Factorization. Omer Levy and Yoav Goldberg. NIPS 2014. https://levyomer.files.wordpress.com/2014/09/neural-word-embeddings-as-implicit-matrix-factorization.pdf We analyze skip-gram with negative-sampling (SGNS), a word embedding method introduced by Mikolov et al., and show that it is implicitly factorizing a word-context matrix, whose cells are the pointwise mutual information (PMI) of the respective word and context pairs, shifted by a global constant.




Text classification is very important in the commercial world; spam or clickbait filtering being perhaps the most ubiquitous example. There are tools that design models for general classification problems (such as Vowpal Wabbit or libSVM), but fastText is exclusively dedicated to text classification. This allows it to be quickly trained on extremely large datasets. We have seen results of models trained on more than 1 billion words in less than 10 minutes using a standard multicore CPU. FastText can also classify a half-million sentences among more than 300,000 categories in less than five minutes. See also: https://arxiv.org/pdf/1607.04606v1.pdf and https://arxiv.org/pdf/1607.01759v2.pdf



http://www.cs.columbia.edu/~blei/papers/RudolphRuizMandtBlei2016.pdf Exponential Family Embeddings

Each type of embedding model defines the context, the exponential family of conditional distributions, and how the latent embedding vectors are shared across data. We infer the embeddings with a scalable algorithm based on stochastic gradient descent.

https://arxiv.org/pdf/1607.04606v1.pdf Enriching Word Vectors with Subword Information

In this paper, we propose a new approach based on the skip-gram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram, words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpus quickly. We evaluate the obtained word representations on five different languages, on word similarity and analogy tasks.


We present Submatrix-wise Vector Embedding Learner (Swivel), a method for generating low-dimensional feature embeddings from a feature co-occurrence matrix. Swivel performs approximate factorization of the point-wise mutual information matrix via stochastic gradient descent. It uses a piecewise loss with special handling for unobserved co-occurrences, and thus makes use of all the information in the matrix. While this requires computation proportional to the size of the entire matrix, we make use of vectorized multiplication to process thousands of rows and columns at once to compute millions of predicted values. Furthermore, we partition the matrix into shards in order to parallelize the computation across many nodes. This approach results in more accurate embeddings than can be achieved with methods that consider only observed co-occurrences, and can scale to much larger corpora than can be handled with sampling methods. https://github.com/tensorflow/models/tree/master/swivel

http://jmlr.org/proceedings/papers/v37/kusnerb15.pdf From Word Embeddings To Document Distances


http://www.cl.cam.ac.uk/research/rainbow/projects/shape2vec/ https://github.com/ftasse/Shape2Vec

A neural network is trained to generate shape descriptors that lie close to a vector representation of the shape class, given a vector space of words. This method is easily extendable to range scans, hand-drawn sketches and images. This makes cross-modal retrieval possible, without a need to design different methods depending on the query type. We show that sketch-based shape retrieval using semantic-based descriptors outperforms the state-of-the-art by large margins, and mesh-based retrieval generates results of higher relevance to the query, than current deep shape descriptors.

https://arxiv.org/pdf/1704.04601v1.pdf MUSE: Modularizing Unsupervised Sense Embeddings

This paper proposes to address the word sense ambiguity issue in an unsupervised manner, where word sense representations are learned along a word sense selection mechanism given contexts.

https://arxiv.org/pdf/1704.08012v1.pdf Topically Driven Neural Language Model https://github.com/jhlau/topically-driven-language-model

Language models are typically applied at the sentence level, without access to the broader document context. We present a neural language model that incorporates document context in the form of a topic model-like architecture, thus providing a succinct representation of the broader document context outside of the current sentence. Experiments over a range of datasets demonstrate that our model outperforms a pure sentence-based model in terms of language model perplexity, and leads to topics that are potentially more coherent than those produced by a standard LDA topic model. Our model also has the ability to generate related sentences for a topic, providing another way to interpret topics.

https://arxiv.org/abs/1704.08424 Multimodal Word Distributions

Word embeddings provide point representations of words containing useful semantic information. We introduce multimodal word distributions formed from Gaussian mixtures, for multiple word meanings, entailment, and rich uncertainty information. To learn these distributions, we propose an energy-based max-margin objective. We show that the resulting approach captures uniquely expressive semantic information, and outperforms alternatives, such as word2vec skip-grams, and Gaussian embeddings, on benchmark datasets such as word similarity and entailment. https://github.com/benathi/word2gm

https://arxiv.org/pdf/1703.00993.pdf A Comparative Study of Word Embeddings for Reading Comprehension

https://arxiv.org/abs/1705.04301v1 A Feature Embedding Strategy for High-level CNN representations from Multiple ConvNets

https://arxiv.org/abs/1705.03556v1 Relevance-based Word Embedding

In this paper, we propose two learning models with different objective functions; one learns a relevance distribution over the vocabulary set for each query, and the other classifies each term as belonging to the relevant or non-relevant class for each query.

https://arxiv.org/pdf/1705.04416.pdf Evaluating vector-space models of analogy

We evaluate the parallelogram model of analogy as applied to modern word embeddings, providing a detailed analysis of the extent to which this approach captures human relational similarity judgments in a large benchmark dataset. We find that that some semantic relationships are better captured than others. We then provide evidence for deeper limitations of the parallelogram model based on the intrinsic geometric constraints of vector spaces, paralleling classic results for first-order similarity.

https://arxiv.org/abs/1705.10359v1 Neural Embeddings of Graphs in Hyperbolic Space

ecent work has shown that the appropriate isometric space for embedding complex networks is not the flat Euclidean space, but negatively curved, hyperbolic space. We present a new concept that exploits these recent insights and propose learning neural embeddings of graphs in hyperbolic space. We provide experimental evidence that embedding graphs in their natural geometry significantly improves performance on downstream tasks for several real-world public datasets.

https://arxiv.org/abs/1705.08039 Poincaré Embeddings for Learning Hierarchical Representations

Representation learning has become an invaluable approach for learning from symbolic data such as text and graphs. However, while complex symbolic datasets often exhibit a latent hierarchical structure, state-of-the-art methods typically learn embeddings in Euclidean vector spaces, which do not account for this property. For this purpose, we introduce a new approach for learning hierarchical representations of symbolic data by embedding them into hyperbolic space – or more precisely into an n-dimensional Poincar\'e ball. Due to the underlying hyperbolic geometry, this allows us to learn parsimonious representations of symbolic data by simultaneously capturing hierarchy and similarity. We introduce an efficient algorithm to learn the embeddings based on Riemannian optimization and show experimentally that Poincar\'e embeddings outperform Euclidean embeddings significantly on data with latent hierarchies, both in terms of representation capacity and in terms of generalization ability.

https://arxiv.org/pdf/1705.11168v1.pdf Are distributional representations ready for the real world? Evaluating word vectors for grounded perceptual meaning

We find that several standard word representations fail to encode many salient perceptual features of concepts, and show that these deficits correlate with word-word similarity prediction errors. Our analyses provide motivation for grounded and embodied language learning approaches, which may help to remedy these deficits.

https://arxiv.org/pdf/1705.10819v1.pdf Surface Networks

we propose several upgrades to GNNs to leverage extrinsic differential geometry properties of three-dimensional surfaces, increasing its modeling power. In particular, we propose to exploit the Dirac operator, whose spectrum detects principal curvature directions — this is in stark contrast with the classical Laplace operator, which directly measures mean curvature. We coin the resulting model the Surface Network (SN). We demonstrate the efficiency and versatility of SNs on two challenging tasks: temporal prediction of mesh deformations under non-linear dynamics and generative models using a variational autoencoder framework with encoders/decoders given by SNs.

https://arxiv.org/pdf/1705.10900v1.pdf Does the Geometry of Word Embeddings Help Document Classification? A Case Study on Persistent Homology Based Representations

https://arxiv.org/abs/1706.00286 Learning to Compute Word Embeddings on the Fly

Words in natural language follow a Zipfian distribution whereby some words are frequent but most are rare. Learning representations for words in the “long tail” of this distribution requires enormous amounts of data. Representations of rare words trained directly on end-tasks are usually poor, requiring us to pre-train embeddings on external data, or treat all rare words as out-of-vocabulary words with a unique representation. We provide a method for predicting embeddings of rare words on the fly from small amounts of auxiliary data with a network trained against the end task. We show that this improves results against baselines where embeddings are trained on the end task in a reading comprehension task, a recognizing textual entailment task, and in language modelling.

https://arxiv.org/abs/1706.02413 PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

In this work, we introduce a hierarchical neural network that applies PointNet recursively on a nested partitioning of the input point set. By exploiting metric space distances, our network is able to learn local features with increasing contextual scales.

https://arxiv.org/abs/1706.02496 Context encoders as a simple but powerful extension of word2vec

However, as only a single embedding is learned for every word in the vocabulary, the model fails to optimally represent words with multiple meanings. Additionally, it is not possible to create embeddings for new (out-of-vocabulary) words on the spot. Based on an intuitive interpretation of the continuous bag-of-words (CBOW) word2vec model's negative sampling training objective in terms of predicting context based similarities, we motivate an extension of the model we call context encoders (ConEc). By multiplying the matrix of trained word2vec embeddings with a word's average context vector, out-of-vocabulary (OOV) embeddings and representations for a word with multiple meanings can be created based on the word's local contexts. The benefits of this approach are illustrated by using these word embeddings as features in the CoNLL 2003 named entity recognition (NER) task.


https://arxiv.org/abs/1707.01793v1 A Simple Approach to Learn Polysemous Word Embeddings

Many NLP applications require disambiguating polysemous words. Existing methods that learn polysemous word vector representations involve first detecting various senses and optimizing the sense-specific embeddings separately, which are invariably more involved than single sense learning methods such as word2vec. Evaluating these methods is also problematic, as rigorous quantitative evaluations in this space is limited, especially when compared with single-sense embeddings. In this paper, we propose a simple method to learn a word representation, given {\it any} context. Our method only requires learning the usual single sense representation, and coefficients that can be learnt via a single pass over the data. We propose several new test sets for evaluating word sense induction, relevance detection, and contextual word similarity, significantly supplementing the currently available tests. Results on these and other tests show that while our method is embarrassingly simple, it achieves excellent results when compared to the state of the art models for unsupervised polysemous word representation learning.


https://arxiv.org/abs/1707.01793v1 A Simple Approach to Learn Polysemous Word Embeddings

Our method only requires learning the usual single sense representation, and coefficients that can be learnt via a single pass over the data. We propose several new test sets for evaluating word sense induction, relevance detection, and contextual word similarity, significantly supplementing the currently available tests.

https://arxiv.org/abs/1707.02377v1 Efficient Vector Representation for Documents through Corruption

Doc2VecC represents each document as a simple average of word embeddings. It ensures a representation generated as such captures the semantic meanings of the document during learning. A corruption model is included, which introduces a data-dependent regularization that favors informative or rare words while forcing the embeddings of common and non-discriminative ones to be close to zero.

https://arxiv.org/abs/1705.03556v2 Relevance-based Word Embedding

The primary objective in various IR tasks is to capture relevance instead of term proximity, syntactic, or even semantic similarity. This is the motivation for developing unsupervised relevance-based word embedding models that learn word representations based on query-document relevance information. In this paper, we propose two learning models with different objective functions; one learns a relevance distribution over the vocabulary set for each query, and the other classifies each term as belonging to the relevant or non-relevant class for each query.

https://arxiv.org/abs/1707.04596 DocTag2Vec: An Embedding Based Multi-label Learning Approach for Document Tagging

In this work, we propose a novel yet simple approach called DocTag2Vec to accomplish this task. We substantially extend Word2Vec and Doc2Vec—two popular models for learning distributed representation of words and documents. In DocTag2Vec, we simultaneously learn the representation of words, documents, and tags in a joint vector space during training, and employ the simple k-nearest neighbor search to predict tags for unseen documents. In contrast to previous multi-label learning methods, DocTag2Vec directly deals with raw text instead of provided feature vector, and in addition, enjoys advantages like the learning of tag representation, and the ability of handling newly created tags.

https://github.com/yuvalpinter/mimick Mimicking Word Embeddings using Subword RNNs

https://arxiv.org/abs/1703.05908 Learning Robust Visual-Semantic Embeddings

Taking advantage of the recent success of unsupervised learning in deep neural networks, we propose an end-to-end learning framework that is able to extract more robust multi-modal representations across domains. The proposed method combines representation learning models (i.e., auto-encoders) together with cross-domain learning criteria.



http://www.cs.cornell.edu/~laurejt/papers/sgns-geometry-2017.pdf The strange geometry of skip-gram with negative sampling

Despite their ubiquity, word embeddings trained with skip-gram negative sampling (SGNS) remain poorly understood. We find that vector positions are not simply determined by semantic similarity, but rather occupy a narrow cone, diametrically opposed to the context vectors. We show that this geometric concentration depends on the ratio of positive to negative examples, and that it is neither theoretically nor empirically inherent in related embedding algorithms.

https://arxiv.org/pdf/1706.01967v2.pdf Synergistic Union of Word2Vec and Lexicon for Domain Specific Semantic Similarity


https://arxiv.org/abs/1710.02971v2 Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec

https://openreview.net/pdf?id=S1xDcSR6W HYBED: HYPERBOLIC NEURAL GRAPH EMBEDDING

https://arxiv.org/abs/1711.08014 The Riemannian Geometry of Deep Generative Models

In this paper we have introduced methods for exploring the Riemannian geometry of manifolds learned by deep generative models. Our experiments show that these models represent real image Consequently, straight lines in the latent space are relatively close to geodesic curves on the manifold. This fact may explain why traversal in the latent space results in visually plausible changes to the generated data: curvilinear distances in the original data metric are roughly preserved.

Also, even for the results presented here, the role of curvature should not be completely discounted: there are still differences between latent distances and geodesic distances that may have more nuanced effects in certain applications

https://arxiv.org/abs/1801.01884v2 Unsupervised Low-Dimensional Vector Representations for Words, Phrases and Text that are Transparent, Scalable, and produce Similarity Metrics that are Complementary to Neural Embeddings

We present here a simple unsupervised method for representing words, phrases or text as a low dimensional vector, in which the meaning and relative importance of dimensions is transparent to inspection.

https://www.biorxiv.org/content/early/2018/02/01/258665 Connecting conceptual and spatial search via a model of generalization

https://arxiv.org/abs/1802.05365 Deep contextualized word representations

We introduce a new type of deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). Our word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. We show that these representations can be easily added to existing models and significantly improve the state of the art across six challenging NLP problems, including question answering, textual entailment and sentiment analysis. We also present an analysis showing that exposing the deep internals of the pre-trained network is crucial, allowing downstream models to mix different types of semi-supervision signals.

https://arxiv.org/abs/1709.06673v2 Why PairDiff works? – A Mathematical Analysis of Bilinear Relational Compositional Operators for Analogy Detection

https://arxiv.org/abs/1803.04488v1 Concept2vec: Metrics for Evaluating Quality of Embeddings for Ontological Concepts

In this paper, we introduce a framework containing three distinct tasks concerned with the individual aspects of ontological concepts: (i) the categorization aspect, (ii) the hierarchical aspect, and (iii) the relational aspect. Then, in the scope of each task, a number of intrinsic metrics are proposed for evaluating the quality of the embeddings. Furthermore, w.r.t. this framework multiple experimental studies were run to compare the quality of the available embedding models. Employing this framework in future research can reduce misjudgment and provide greater insight about quality comparisons of embeddings for ontological concepts.

https://arxiv.org/abs/1803.05651v1 Word2Bits - Quantized Word Vectors

Word vectors require significant amounts of memory and storage, posing issues to resource limited devices like mobile phones and GPUs. We show that high quality quantized word vectors using 1-2 bits per parameter can be learned by introducing a quantization function into Word2Vec. We furthermore show that training with the quantization function acts as a regularizer. We train word vectors on English Wikipedia (2017) and evaluate them on standard word similarity and analogy tasks and on question answering (SQuAD). Our quantized word vectors not only take 8-16x less space than full precision (32 bit) word vectors but also outperform them on word similarity tasks and question answering.

https://arxiv.org/abs/1804.00891 Hyperspherical Variational Auto-Encoders But although the default choice of a Gaussian distribution for both the prior and posterior represents a mathematically convenient distribution often leading to competitive results, we show that this parameterization fails to model data with a latent hyperspherical structure. To address this issue we propose using a von Mises-Fisher (vMF) distribution instead, leading to a hyperspherical latent space. Through a series of experiments we show how such a hyperspherical VAE, or -VAE, is more suitable for capturing data with a hyperspherical latent structure, while outperforming a normal, -VAE, in low dimensions on other data types. https://github.com/nicola-decao/s-vae

https://arxiv.org/pdf/1803.11175v1.pdf Universal Sentence Encoder

https://arxiv.org/abs/1804.01882v1 Hyperbolic Entailment Cones for Learning Hierarchical Embeddings

We here present a novel method to embed directed acyclic graphs. Following prior work, we first advocate for using hyperbolic spaces which provably model tree-like structures better than Euclidean geometry. Second, we view hierarchical relations as partial orders defined using a family of nested geodesically convex cones. We prove that these entailment cones admit an optimal shape with a closed form expression both in the Euclidean and hyperbolic spaces. Moreover, they canonically define the embedding learning process. Experiments show significant improvements of our method over strong recent baselines both in terms of representational capacity and generalization.

https://arxiv.org/abs/1803.00502v3 PIP Distance: A Unitary-invariant Metric for Understanding Functionality and Dimensionality of Vector Embeddings

With tools from perturbation and stability theory, we provide an upper bound on the PIP loss using the signal spectrum and noise variance, both of which can be readily inferred from data. Our framework sheds light on many empirical phenomena, including the existence of an optimal dimension, and the robustness of embeddings against over-parametrization. The bias-variance tradeoff of PIP loss explicitly answers the fundamental open problem of dimensionality selection for vector embeddings. https://github.com/aaaasssddf/PIP-experiments

In this paper, we introduce a mathematically sound theory for vector embeddings, from a stability point of view. Our theory answers some open questions, in particular: 1. What is an appropriate metric for comparing different vector embeddings? 2. How to select dimensionality for vector embeddings? 3. Why people choose different dimensionalities but they all work well in practice? We present a theoretical analysis for embeddings starting from first principles. We first propose a novel objective, the Pairwise Inner Product (PIP) loss. The PIP loss is closely related to the functionality differences between the embeddings, and a small PIP loss means the two embeddings are close for all practical purposes. We then develop matrix perturbation tools that quantify the objective, for embeddings explicitly or implicitly obtained from matrix factorization. Practical, data-driven upper bounds will also be given. Finally, we conduct extensive empirical studies and validate our theory on real datasets. With this theory, we provide answers to three open questions about vector embeddings, namely the robustness to over-parametrization, forward stability, and dimensionality selection.

https://www.arxiv-vanity.com/papers/1803.04488/ Concept2vec: Metrics for Evaluating Quality of Embeddings for Ontological Concepts

Since ontological concepts play a crucial role in knowledge graphs, providing high quality embeddings for them is highly important. In this paper, we introduced a framework containing three distinct tasks concerned with the individual aspects of ontological concepts, (i) the categorization aspect, (ii) the hierarchical aspect, and (iii) the relational aspect. Then, for each task a number of intrinsic metrics were proposed for evaluating the quality of the embeddings. Furthermore, we prepared a suitable data set and ran a series of comparison studies on the popular embedding models for ontological concepts. We encourage the research community to utilize this framework in their future evaluation scenarios on embedding models.

https://arxiv.org/abs/1804.06323 When and Why are Pre-trained Word Embeddings Useful for Neural Machine Translation?

https://arxiv.org/abs/1709.03856v5 StarSpace: Embed All The Things!

https://arxiv.org/abs/1804.07983v1 Context-Attentive Embeddings for Improved Sentence Representations

While one of the first steps in many NLP systems is selecting what embeddings to use, we argue that such a step is better left for neural networks to figure out by themselves. To that end, we introduce a novel, straightforward yet highly effective method for combining multiple types of word embeddings in a single model, leading to state-of-the-art performance within the same model class on a variety of tasks. We subsequently show how the technique can be used to shed new insight into the usage of word embeddings in NLP systems.

We argue that the decision of which word embeddings to use in what setting should be left to the neural network. While people usually pick one type of word embeddings for their NLP systems and then stick with it, we find that a contextattentive approach, where embeddings are selected depending on the context, leads to better results. In addition, we showed that the proposed mechanism leads to better interpretability and insightful linguistic analysis. We showed that the network learns to select different embeddings for different data and different tasks.

https://arxiv.org/abs/1804.09843v1 Hierarchical Density Order Embeddings

https://arxiv.org/abs/1805.09786 Hyperbolic Attention Networks

We introduce hyperbolic attention networks to endow neural networks with enough capacity to match the complexity of data with hierarchical and power-law structure. A few recent approaches have successfully demonstrated the benefits of imposing hyperbolic geometry on the parameters of shallow networks. We extend this line of work by imposing hyperbolic geometry on the activations of neural networks. This allows us to exploit hyperbolic geometry to reason about embeddings produced by deep networks. We achieve this by re-expressing the ubiquitous mechanism of soft attention in terms of operations defined for hyperboloid and Klein models. Our method shows improvements in terms of generalization on neural machine translation, learning on graphs and visual question answering tasks while keeping the neural representations compact.


We introduced Latent Embedding Optimization (LEO), a gradient-based meta-learning technique which uses a parameter generative model in order to capture the diverse range of parameters useful for a distribution over tasks, paving the way for a new state-of-the-art result on the challenging 5-way 1-shot miniImageNet classification problem. LEO is able to achieve this by reducing the effective numbers of adapted parameters by one order of magnitude, while still making use of large models with millions of parameters for feature extraction. This approach leads to a computationally inexpensive optimization-based meta-learner with best in class generalization performance.

https://arxiv.org/abs/1808.06068 SeVeN: Augmenting Word Embeddings with Unsupervised Relation Vectors

https://arxiv.org/abs/1809.01703 Hyperbolic Recommender Systems

Unlike Euclidean spaces, Hyperbolic spaces are intrinsically equipped to handle hierarchical structure, encouraged by its property of exponentially increasing distances away from origin. We propose HyperBPR (Hyperbolic Bayesian Personalized Ranking), a conceptually simple but highly effective model for the task at hand. Our proposed HyperBPR not only outperforms their Euclidean counterparts, but also achieves state-of-the-art performance on multiple benchmark datasets, demonstrating the effectiveness of personalized recommendation in Hyperbolic space.

https://arxiv.org/abs/1809.01498 Skip-gram word embeddings in hyperbolic space

Embeddings of tree-like graphs in hyperbolic space were recently shown to surpass their Euclidean counterparts in performance by a large margin. Inspired by these results, we present an algorithm for learning word embeddings in hyperbolic space from free text. An objective function based on the hyperbolic distance is derived and included in the skip-gram architecture from word2vec. The hyperbolic word embeddings are then evaluated on word similarity and analogy benchmarks. The results demonstrate the potential of hyperbolic word embeddings, particularly in low dimensions, though without clear superiority over their Euclidean counterparts. We further discuss problems in the formulation of the analogy task resulting from the curvature of hyperbolic space.

https://arxiv.org/abs/1311.1539v1 Category-Theoretic Quantitative Compositional Distributional Models of Natural Language Semantics

https://arxiv.org/abs/1705.04416v2 Evaluating vector-space models of analogy

We find that that some semantic relationships are better captured than others. We then provide evidence for deeper limitations of the parallelogram model based on the intrinsic geometric constraints of vector spaces, paralleling classic results for first-order similarity.

https://arxiv.org/abs/1810.04882v1 Towards Understanding Linear Word Analogies

https://arxiv.org/pdf/1809.06211.pdf ManifoldNet: A Deep Network Framework for Manifold-valued Data https://github.com/jjbouza/manifold-net

https://openreview.net/forum?id=S1ry6Y1vG Faster Neural Networks Straight from JPEG

https://openreview.net/forum?id=rJg4J3CqFm Learning Entropic Wasserstein Embeddings

We examine empirically the representational capacity of such learned Wasserstein embeddings, showing that they can embed a wide variety of complex metric structures with smaller distortion than an equivalent Euclidean embedding. We also investigate an application to word embedding, demonstrating a unique advantage of Wasserstein embeddings: we can directly visualize the high-dimensional embedding, as it is a probability distribution on a low-dimensional space. This obviates the need for dimensionality reduction techniques such as t-SNE for visualization.