SoftMax Approximation

Known Uses

So we had to evolve our neural network architecture further to reduce the latency to less than 200ms. We moved from using a softmax layer to a hierarchical softmax layer which traverses a tree of words instead of traversing a list of words thus making it more efficient.

References A Scalable Hierarchical Distributed Language Model We introduce a fast hierarchical language model along with a simple feature-based algorithm for automatic construction of word trees from the data. We then show that the resulting models can outperform non-hierarchical neural models as well as the best n-gram models. Distributed Representations of Words and Phrases and their Compositionality Strategies for Training Large Vocabulary Neural Language Models Dealing with a large number of classes – Likelihood, Discrimination or Ranking?

In contrast to recently introduced alternative approaches, a simple approximation of the the standard maximum likelihood objective provides an easily implementable and competitive method for fast large-class classification. Efficient softmax approximation for GPUs Pointer Sentinel Mixture Models

Recent neural network sequence models with softmax classifiers have achieved their best language modeling performance only with very large hidden states and large vocabularies. Even then they struggle to predict rare or unseen words even if the context makes the prediction unambiguous. We introduce the pointer sentinel mixture architecture for neural sequence models which has the ability to either reproduce a word from the recent context or produce a word from a standard softmax classifier. Our pointer sentinel-LSTM model achieves state of the art language modeling performance on the Penn Treebank (70.9 perplexity) while using far fewer parameters than a standard softmax LSTM. In order to evaluate how well language models can exploit longer contexts and deal with more realistic vocabularies and larger corpora we also introduce the freely available WikiText corpus. Be Careful What You Backpropagate: A Case For Linear Output Activations & Gradient Boosting

In this work, we show that saturating output activation functions, such as the softmax, impede learning on a number of standard classification tasks. Moreover, we present results showing that the utility of softmax does not stem from the normalization, as some have speculated. In fact, the normalization makes things worse. Rather, the advantage is in the exponentiation of error gradients. This exponential gradient boosting is shown to speed up convergence and improve generalization. To this end, we demonstrate faster convergence and better performance on diverse classification tasks: image classification using CIFAR-10 and ImageNet, and semantic segmentation using PASCAL VOC 2012. In the latter case, using the state-of-the-art neural network architecture, the model converged 33% faster with our method (roughly two days of training less) than with the standard softmax activation, and with a slightly better performance to boot.

Taking the consequence of this, by e.g. skipping the normalization term of the softmax, we get significant improvement in our NN training—and at no other cost than a few minutes of coding. The only drawback is the introduction of some new hyper-paramters, α, β, and the target values. However, these have been easy to choose, and we do not expect that a lot of tedious fine-tuning is required in the general case. BREAKING THE SOFTMAX BOTTLENECK: A HIGH-RANK RNN LANGUAGE MODEL

We formulate language modeling as a matrix factorization problem, and show that the expressiveness of Softmax-based models (including the majority of neural language models) is limited by a Softmax bottleneck. Given that natural language is highly context-dependent, this further implies that in practice Softmax with distributed word embeddings does not have enough capacity to model natural language.

Specifically, we introduce discrete latent variables into a recurrent language model, and formulate the next-token probability distribution as a Mixture of Softmaxes (MoS). Mixture of Softmaxes is more expressive than Softmax and other surrogates considered in prior work. Moreover, we show that MoS learns matrices that have much larger normalized singular values and thus much higher rank than Softmax and other baselines on real-world datasets.