Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
multi-scale [2017/07/06 10:56]
127.0.0.1 external edit
multi-scale [2018/03/23 10:35]
admin
Line 19: Line 19:
  
 Instead of predicting one symbol at a time, our multi-scale model makes predictions over multiple, potentially overlapping multi-symbol tokens. A variation of the byte-pair encoding (BPE) compression algorithm is used to learn the dictionary of tokens that the model is trained with. When applied to language modelling, our model has the flexibility of character-level models while maintaining many of the performance benefits of word-level models. Our experiments show that this model performs better than a regular LSTM on language modeling tasks, especially for smaller models. Instead of predicting one symbol at a time, our multi-scale model makes predictions over multiple, potentially overlapping multi-symbol tokens. A variation of the byte-pair encoding (BPE) compression algorithm is used to learn the dictionary of tokens that the model is trained with. When applied to language modelling, our model has the flexibility of character-level models while maintaining many of the performance benefits of word-level models. Our experiments show that this model performs better than a regular LSTM on language modeling tasks, especially for smaller models.
 +
 +https://​arxiv.org/​abs/​1803.08240 An Analysis of Neural Language Modeling at Multiple Scales
 +
 +Many of the leading approaches in language modeling introduce novel, complex and specialized architectures. We take existing state-of-the-art word level language models based on LSTMs and QRNNs and extend them to both larger vocabularies as well as character-level granularity. When properly tuned, LSTMs and QRNNs achieve state-of-the-art results on character-level (Penn Treebank, enwik8) and word-level (WikiText-103) datasets, respectively. Results are obtained in only 12 hours (WikiText-103) to 2 days (enwik8) using a single modern GPU.
 +
 +By
 +extending an existing state-of-the-art word level language
 +model based on LSTMs and QRNNs, we show that a well
 +tuned baseline can achieve state-of-the-art results on both
 +character-level (Penn Treebank, enwik8) and word-level
 +(WikiText-103) datasets without relying on complex or specialized
 +architectures. We additionally perform an empirical
 +investigation of the learning and network dynamics of both
 +LSTM and QRNN cells across different language modeling
 +tasks, highlighting the differences between the learned
 +character and word level models. Finally, we present results
 +which shed light on the relative importance of the
 +various hyperparameters in neural language models. On
 +the WikiText-2 data set, the AWD-QRNN model exhibited
 +higher sensitivity to the hidden-to-hidden weight dropout
 +and input dropout terms and relative insensitivity to the embedding
 +and hidden layer sizes. We hope that this insight
 +would be useful for practitioners intending to tune similar
 +models on new datasets.
 +
 +We analyze the relative importance of the hyperparameters
 +defining the model using a Random Forest approach for
 +the word-level task on the smaller WikiText-2 data set for AWDQRNN
 +model. The results show that weight dropout, hidden
 +dropout and embedding dropout impact performance the most
 +while the number of layers and the embedding and hidden dimension
 +sizes matters relatively less. Similar results are observed on
 +the PTB word level data set.