https://arxiv.org/abs/1511.07916 Natural Language Understanding with Distributed Representation

This is a lecture note for the course DS-GA 3001 <Natural Language Understanding with Distributed Representation> at the Center for Data Science , New York University in Fall, 2015. As the name of the course suggests, this lecture note introduces readers to a neural network based approach to natural language understanding/processing. In order to make it as self-contained as possible, I spend much time on describing basics of machine learning and neural networks, only after which how they are used for natural languages is introduced. On the language front, I almost solely focus on language modelling and machine translation, two of which I personally find most fascinating and most fundamental to natural language understanding.

https://arxiv.org/pdf/1610.00479v2.pdf Nonsymbolic Text Representation

Character-level models are attracting increasing interest. We group them into three classes. (i) Character-level models of words derive a word representation from the character string, but they are symbolic in that they need text segmented into tokens as input. (ii) Bag-of-ngram models discard the order of character ngrams on the assumption that relevant order information is coded in the ngrams, so the order of ngrams can be neglected. (iii) End-to-end models learn a separate model on the raw character (or byte) input for each task; these models estimate task-specific parameters, but no representation of text that would be usable across tasks is computed.

We take the view that end-to-end learning without any representation learning is not a good approach for NLP. we propose the first nonsymbolic utilization method that fully represents sequence information – in contrast to utilization methods like bag-ofngrams that discard sequence information that is not directly encoded in the character ngrams themselves. We show that our models perform better than prior work on an information extraction and a text denoising task.

http://openreview.net/pdf?id=BJC_jUqxe A SELF-ATTENTIVE SENTENCE EMBEDDING

This paper proposes a new model for extracting an interpretable sentence embedding by introducing self-attention. Instead of using a vector, we use a 2-D matrix to represent the embedding, with each row of the matrix attending on a different part of the sentence. We also propose a self-attention mechanism and a special regularization term for the model. As a side effect, the embedding comes with an easy way of visualizing what specific parts of the sentence are encoded into the embedding. We evaluate our model on 3 different tasks: author profiling, sentiment classification and textual entailment. Results show that our model yields a significant performance gain compared to other sentence embedding methods in all of the 3 tasks.

The model is able to encode any sequence with variable length into a fixed size representation, without suffering from long-term dependency problems. This brings a lot of scalability to the model: without any modification, it can be applied directly to longer contents like paragraphs, articles, etc.

https://explosion.ai/blog/deep-learning-formula-nlp Embed, encode, attend, predict: The new deep learning formula for state-of-the-art NLP models

https://arxiv.org/abs/1611.01462 Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling

Recurrent neural networks have been very successful at predicting sequences of words in tasks such as language modeling. However, all such models are based on the conventional classification framework, where model is trained against one-hot targets, and each word is represented both as an input and as an output in isolation. This causes inefficiencies in learning both in terms of utilizing all of the information and in terms of the number of parameters needed to train. We introduce a novel theoretical framework that facilitates better learning in language modeling, and show that our framework leads to tying together the input embedding and the output projection matrices, greatly reducing the number of trainable variables. Our LSTM model lowers the state of the art word-level perplexity on the Penn Treebank to 68.5.

https://arxiv.org/abs/1605.07725 Adversarial Training Methods for Semi-Supervised Text Classification

We extend adversarial and virtual adversarial training to the text domain by applying perturbations to the word embeddings in a recurrent neural network rather than to the original input itself. The proposed method achieves state of the art results on multiple benchmark semi-supervised and purely supervised tasks. We provide visualizations and analysis showing that the learned word embeddings have improved in quality and that while training, the model is less prone to overfitting.

https://arxiv.org/pdf/1612.04629v1.pdf How Grammatical is Character-level Neural Machine Translation? Assessing MT Quality with Contrastive Translation Pairs

we propose a novel method to assess how well NMT systems model specific linguistic phenomena such as agreement over long distances, the production of novel words, and the faithful translation of polarity.

http://dlacombejr.github.io/2016/11/13/deep-learning-for-regex.html#prusaEmbeddings2016 (Note reference to IEEE article about character level embedding)

https://arxiv.org/abs/1611.01724 Words or Characters? Fine-grained Gating for Reading Comprehension

Previous work combines word-level and character-level representations using concatenation or scalar weighting, which is suboptimal for high-level tasks like reading comprehension. We present a fine-grained gating mechanism to dynamically combine word-level and character-level representations based on properties of the words. We also extend the idea of fine-grained gating to modeling the interaction between questions and paragraphs for reading comprehension. Experiments show that our approach can improve the performance on reading comprehension tasks, achieving new state-of-the-art results on the Children's Book Test dataset. To demonstrate the generality of our gating mechanism, we also show improved results on a social media tag prediction task.

https://arxiv.org/pdf/1701.02025v1.pdf Multi-level Representations for Fine-Grained Typing of Knowledge Base Entities

Entities are essential elements of natural language. In this paper, we present methods for learning multi-level representations of entities on three complementary levels: character (character patterns in entity names extracted, e.g., by neural networks), word (embeddings of words in entity names) and entity (entity embeddings).

We confirm experimentally that each level of representation contributes complementary information and a joint representation of all three levels improves the existing embedding based baseline for fine-grained entity typing by a large margin. Additionally, we show that adding information from entity descriptions further improves multi-level representations of entities.

https://arxiv.org/abs/1103.0398 Natural Language Processing (almost) from Scratch

https://arxiv.org/pdf/1702.00764v1.pdf Symbolic, Distributed and Distributional Representations for Natural Language Processing in the Era of Deep Learning: a Survey.

https://arxiv.org/pdf/1703.04213v1.pdf MetaPAD: Meta Pattern Discovery from Massive Text Corpora

Mining textual patterns in news, tweets, papers, and many other kinds of text corpora has been an active theme in text mining and NLP research. Previous studies adopt a dependency parsing-based pattern discovery approach. However, the parsing results lose rich context around entities in the patterns, and the process is costly for a corpus of large scale. In this study, we propose a novel typed textual pattern structure, called meta pattern, which is extended to a frequent, informative, and precise subsequence pattern in certain context. We propose an efficient framework, called MetaPAD, which discovers meta patterns from massive corpora with three techniques: (1) it develops a context-aware segmentation method to carefully determine the boundaries of patterns with a learnt pattern quality assessment function, which avoids costly dependency parsing and generates high-quality patterns; (2) it identifies and groups synonymous meta patterns from multiple facets—their types, contexts, and extractions; and (3) it examines type distributions of entities in the instances extracted by each group of patterns, and looks for appropriate type levels to make discovered patterns precise. Experiments demonstrate that our proposed framework discovers high-quality typed textual patterns efficiently from different genres of massive corpora and facilitates information extraction. https://github.com/mjiang89/MetaPAD

https://www.youtube.com/watch?v=nFCxTtBqF5U Representations for Language: From Word Embeddings to Sentence Meanings

https://arxiv.org/abs/1704.00559 Neural Lattice-to-Sequence Models for Uncertain Inputs

The input to a neural sequence-to-sequence model is often determined by an up-stream system, e.g. a word segmenter, part of speech tagger, or speech recognizer. These up-stream models are potentially error-prone. Representing inputs through word lattices allows making this uncertainty explicit by capturing alternative sequences and their posterior probabilities in a compact form. In this work, we extend the TreeLSTM (Tai et al., 2015) into a LatticeLSTM that is able to consume word lattices, and can be used as encoder in an attentional encoder-decoder model.

https://arxiv.org/pdf/1607.04492v2.pdf Neural Tree Indexers for Text Understanding

In this paper, we introduce a robust syntactic parsing-independent tree structured model, Neural Tree Indexers (NTI) that provides a middle ground between the sequential RNNs and the syntactic treebased recursive models. NTI constructs a full n-ary tree by processing the input text with its node function in a bottom-up fashion. Attention mechanism can then be applied to both structure and node function.

Code for the experiments and NTI is available at https://bitbucket.org/tsendeemts/nti

https://arxiv.org/pdf/1704.04859v1.pdf Learning Character-level Compositionality with Visual Features

https://arxiv.org/abs/1704.06918 Neural Machine Translation via Binary Code Prediction

we propose a new method for calculating the output layer in neural machine translation systems. The method is based on predicting a binary code for each word and can reduce computation time/memory requirements of the output layer to be logarithmic in vocabulary size in the best case.

In this study, we proposed neural machine translation models which indirectly predict output words via binary codes, and two model improvements: a hybrid prediction model using both softmax and binary codes, and introducing error-correcting codes to introduce robustness of binary code prediction. Experiments show that the proposed model can achieve comparative translation qualities to standard softmax prediction, while significantly suppressing the amount of parameters in the output layer, and improving calculation speeds while training and especially testing.

https://arxiv.org/abs/1606.01549v3 Gated-Attention Readers for Text Comprehension

Our model, the Gated-Attention (GA) Reader, integrates a multi-hop architecture with a novel attention mechanism, which is based on multiplicative interactions between the query embedding and the intermediate states of a recurrent neural network document reader. https://github.com/bdhingra/ga-reader

https://arxiv.org/pdf/1704.08531v1.pdf A Survey of Neural Network Techniques for Feature Extraction from Text

https://homes.cs.washington.edu/~luheng/files/acl2017_hllz.pdf Deep Semantic Role Labeling : What Works and What’s Next

We introduce a new deep learning model for semantic role labeling (SRL) that significantly improves the state of the art, along with detailed analyses to reveal its strengths and limitations. We use a deep highway BiLSTM architecture with constrained decoding, while observing a number of recent best practices for initialization and regularization. Our 8-layer ensemble model achieves 83.2 F1 on the CoNLL 2005 test set and 83.4 F1 on CoNLL 2012, roughly a 10% relative error reduction over the previous state of the art. Extensive empirical analysis of these gains show that (1) deep models excel at recovering long-distance dependencies but can still make surprisingly obvious errors, and (2) that there is still room for syntactic parsers to improve these results.

https://arxiv.org/abs/1706.03757v2 Semantic Entity Retrieval Toolkit

We describe the Semantic Entity Retrieval Toolkit (SERT) that provides implementations of our previously published entity representation models. The toolkit provides a unified interface to different representation learning algorithms, fine-grained parsing configuration and can be used transparently with GPUs. In addition, users can easily modify existing models or implement their own models in the framework. After model training, SERT can be used to rank entities according to a textual query and extract the learned entity/word representation for use in downstream algorithms, such as clustering or recommendation. https://github.com/cvangysel/SERT

http://www.biorxiv.org/content/early/2017/07/11/162099 Text mining of 15 million full-text scientific articles

We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only.

https://arxiv.org/abs/1708.00055 SemEval-2017 Task 1: Semantic Textual Similarity - Multilingual and Cross-lingual Focused Evaluation

https://github.com/facebookresearch/SentEval A python tool for evaluating the quality of sentence embeddings.

http://deeplearningkit.org/2016/04/23/deep-learning-for-named-entity-recognition/?twitter=@bigdata

https://github.com/salesforce/awd-lstm-lm

https://arxiv.org/abs/1708.06426v1 Cold Fusion: Training Seq2Seq Models Together with Language Models

Sequence-to-sequence (Seq2Seq) models with attention have excelled at tasks which involve generating natural language sentences such as machine translation, image captioning and speech recognition. Performance has further been improved by leveraging unlabeled data, often in the form of a language model. In this work, we present the Cold Fusion method, which leverages a pre-trained language model during training, and show its effectiveness on the speech recognition task. We show that Seq2Seq models with Cold Fusion are able to better utilize language information enjoying i) faster convergence and better generalization, and ii) almost complete transfer to a new domain while using less than 10% of the labeled training data.

https://arxiv.org/pdf/1708.00781v1.pdf Dynamic Entity Representations in Neural Language Models

Understanding a long document requires tracking how entities are introduced and evolve over time. We present a new type of language model, ENTITYNLM, that can explicitly model entities, dynamically update their representations, and contextually generate their mentions. Our model is generative and flexible; it can model an arbitrary number of entities in context while generating each entity mention at an arbitrary length. In addition, it can be used for several different tasks such as language modeling, coreference resolution, and entity prediction. Experimental results with all these tasks demonstrate that our model consistently outperforms strong baselines and prior work.

https://arxiv.org/abs/1806.11532v1 TextWorld: A Learning Environment for Text-based Games https://www.microsoft.com/en-us/research/project/textworld/