Encoder-Decoder is a pattern for learning a transformation from one representation to another. In this context an encoder network encodes an input to a context vector and then a decoder network decodes the context vector to produce the output. The encoder-decoder pattern is used by sequence-to-sequence training of RNNs for automatic language translation applications. Here the input is a string of text in one language and the output is a text string into the other language. More generally, the coder-decoder pattern not limited to the RNN and text. An example are visual captioning systems, where encoders based CNNs and decoders that generate text.

https://arxiv.org/abs/1605.07912v4 Review Networks for Caption Generation

The review network performs a number of review steps with attention mechanism on the encoder hidden states, and outputs a thought vector after each review step; the thought vectors are used as the input of the attention mechanism in the decoder. We show that conventional encoder-decoders are a special case of our framework.

The intuition behind the review network is to review all the information encoded by the encoder and produce vectors that are a more compact, abstractive, and global representation than the original encoder hidden states.

https://arxiv.org/abs/1610.10099v1 Neural Machine Translation in Linear Time

We present a neural translation model, the ByteNet, and a neural language model, the ByteNet Decoder, that aim at addressing these drawbacks. The ByteNet uses convolutional neural networks with dilation for both the source network and the target network. The ByteNet connects the source and target networks via stacking and unfolds the target network dynamically to generate variable length output sequences. We view the ByteNet as an instance of a wider family of sequence-mapping architectures that stack the sub-networks and use dynamic unfolding. The sub-networks themselves may be convolutional or recurrent.

https://arxiv.org/abs/1610.02415 Automatic chemical design using a data-driven continuous representation of molecules

We report a method to convert discrete representations of molecules to and from a multidimensional continuous representation. This generative model allows efficient search and optimization through open-ended spaces of chemical compounds. We train deep neural networks on hundreds of thousands of existing chemical structures to construct two coupled functions: an encoder and a decoder. The encoder converts the discrete representation of a molecule into a real-valued continuous vector, and the decoder converts these continuous vectors back to the discrete representation from this latent space. Continuous representations allow us to automatically generate novel chemical structures by performing simple operations in the latent space, such as decoding random vectors, perturbing known chemical structures, or interpolating between molecules. Continuous representations also allow the use of powerful gradient-based optimization to efficiently guide the search for optimized functional compounds. We demonstrate our method in the design of drug-like molecules as well as organic light-emitting diodes.

http://smerity.com/articles/2016/google_nmt_arch.html

http://gabgoh.github.io/ThoughtVectors/index.html

https://arxiv.org/abs/1804.00823v2 Graph2Seq: Graph to Sequence Learning with Attention-based Neural Networks

In this work, we present a general end-to-end approach to map the input graph to a sequence of vectors, and then another attention-based LSTM to decode the target sequence from these vectors. Specifically, to address inevitable information loss for data conversion, we introduce a novel graph-to-sequence neural network model that follows the encoder-decoder architecture. Our method first uses an improved graph-based neural network to generate the node and graph embeddings by a novel aggregation strategy to incorporate the edge direction information into the node embeddings. We also propose an attention based mechanism that aligns node embeddings and decoding sequence to better cope with large graphs. Experimental results on bAbI task, Shortest Path Task, and Natural Language Generation Task demonstrate that our model achieves the state-of-the-art performance and significantly outperforms other baselines. We also show that with the proposed aggregation strategy, our proposed model is able to quickly converge to good performance.

https://pdfs.semanticscholar.org/bc54/9b2a22c25f5e50efb2ccabf624a8d156feac.pdf?_ga=2.102590593.1648073901.1524444388-1320142241.1524150175 Fast Decoding in Sequence Models Using Discrete Latent Variables

https://openreview.net/pdf?id=B14TlG-RW QANET: COMBINING LOCAL CONVOLUTION WITH GLOBAL SELF-ATTENTION FOR READING COMPREHENSION

We propose a new Q&A architecture called QANet, which does not require recurrent networks: Its encoder consists exclusively of convolution and self-attention, where convolution models local interactions and self-attention models global interactions. On the SQuAD dataset, our model is 3x to 13x faster in training and 4x to 9x faster in inference, while achieving equivalent accuracy to recurrent models. The speed-up gain allows us to train the model with much more data.

Our core innovation is to completely remove the recurrent networks in the encoder. The resulting model is fully feedforward, composed entirely of separable convolutions, attention, linear layers, and layer normalization, which is suitable for parallel computation. The resulting model is both fast and accurate: It surpasses the best published results on SQuAD dataset while up to 13/9 times faster than a competitive recurrent models for a training/inference iteration. Additionally, we find that we are able to achieve significant gains by utilizing data augmentation consisting of translating context and passage pairs to and from another language as a way of paraphrasing the questions and contexts.

https://arxiv.org/abs/1808.07233 Neural Architecture Optimization

We call this new approach neural architecture optimization (NAO). There are three key components in our proposed approach: (1) An encoder embeds/maps neural network architectures into a continuous space. (2) A predictor takes the continuous representation of a network as input and predicts its accuracy. (3) A decoder maps a continuous representation of a network back to its architecture.

https://arxiv.org/abs/1808.03867 Pervasive Attention: 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction