This is a curated list of papers that I have encountered in some capacity and deem worth including in the NLP practitioner's library. Some papers may appear in multiple sub-categories, if they don't fit easily into one of the boxes.
PRs are absolutely welcome!
Some special designations for certain papers:
π‘ LEGEND: This is a game-changer in the NLP literature and worth reading.
πΌ RESOURCE: This paper introduces some dataset/resource and hence may be useful for application purposes.
- (2000) A Statistical Part-of-Speech Tagger
- TLDR: Seminal paper demonstrating a powerful HMM-based POS tagger. Many tips and tricks for building such classical systems included.
- (2003) Feature-rich part-of-speech tagging with a cyclic dependency network
- TLDR: Proposes a number of powerful linguistic features for building a (then) SOTA POS-tagging system
- (2015) Bidirectional LSTM-CRF Models for Sequence Tagging
- TLDR: Proposes an element sequence-tagging model combining neural networks with conditional random fields, achieving SOTA in POS-tagging, NER, and chunking.
- (2003) Accurate unlexicalized parsing π‘
- TLDR: Beautiful paper demonstrating that unlexicalized probabilistic context free grammars can exceed the performance of lexicalized PCFGs.
- (2014) A Fast and Accurate Dependency Parser using Neural Networks
- TLDR: Very important work ushering in a new wave of neural network-based parsing architectures, achieving SOTA performance as well as blazing parsing speeds.
- (2005) Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling
- TLDR: Using cool Monte Carlo methods combined with a conditional random field model, this work achieves a huge error reduction in certain information extraction benchmarks.
- (2015) Bidirectional LSTM-CRF Models for Sequence Tagging
- TLDR: Proposes an element sequence-tagging model combining neural networks with conditional random fields, achieving SOTA in POS-tagging, NER, and chunking.
- (2010) A multi-pass sieve for coreference resolution π‘
- TLDR: Proposes a sieve-based approach to coreference resolution that for many years (until deep learning approaches) was SOTA.
- (2015) Entity-Centric Coreference Resolution with Model Stacking
- TLDR: This work offers a nifty approach to building coreference chains iteratively using entity-level features.
- (2016) Improving Coreference Resolution by Learning Entity-Level Distributed Representations
- TLDR: One of the earliest effective approaches to using neural networks for coreference resolution, significantly outperforming the SOTA.
- (2012) Baselines and Bigrams: Simple, Good Sentiment and Topic Classification
- TLDR: Very elegant paper, illustrating that simple Naive Bayes models with bigram features can outperform more sophisticated methods like support vector machines on tasks such as sentiment analysis.
- (2013) Recursive deep models for semantic compositionality over a sentiment treebank πΌ
- TLDR: Introduces the Stanford Sentiment Treebank, a wonderful resource for fine-grained sentiment annotation on sentences. Also introduces the Recursive Neural Tensor Network, a neat linguistically-motivated deep learning architecture.
- (2007) Natural Logic for Textual Inference
- TLDR: Proposes a rigorous logic-based approach to the problem of textual inference called natural logic. Very cool mathematically-motivated transforms are used to deduce the relationship between phrases.
- (2008) An Extended Model of Natural Logic
- TLDR: Extends previous work on natural logic for inference, adding phenomena such as semantic exclusion and implicativity to enhance the premise-hypothesis transform process.
- (2014) Recursive Neural Networks Can Learn Logical Semantics
- TLDR: Demonstrates that deep learning architectures such as neural tensor networks can effectively be applied to natural language inference.
- (2015) A large annotated corpus for learning natural language inference πΌ
- TLDR: Introduces the Stanford Natural Language Inference corpus, a wonderful NLI resource larger by two orders of magnitude over previous datasets.
- (1993) The Mathematics of Statistical Machine Translation π‘
- TLDR: Introduces the IBM machine translation models, several seminal models in statistical MT.
- (2002) BLEU: A Method for Automatic Evaluation of Machine Translation πΌ
- TLDR: Proposes BLEU, the defacto evaluation technique used for machine translation (even today!)
- (2003) Statistical Phrase-Based Translation
- TLDR: Introduces a phrase-based translation model for MT, doing nice analysis that demonstrates why phrase-based models outperform word-based ones.
- (2014) Sequence to Sequence Learning with Neural Networks π‘
- TLDR: Introduces the sequence-to-sequence neural network architecture. While only applied to MT in this paper, it has since become one of the cornerstone architectures of modern natural language processing.
- (2015) Neural Machine Translation by Jointly Learning to Align and Translate π‘
- TLDR: Extends previous sequence-to-sequence architectures for MT by using the attention mechanism, a powerful tool for allowing a target word to softly search for important signal from the source sentence.
- (2015) Effective approaches to attention-based neural machine translation
- TLDR: Introduces two new attention mechanisms for MT, using them to achieve SOTA over existing neural MT systems.
- (2016) Neural Machine Translation of Rare Words with Subword Units
- TLDR: Introduces byte pair encoding, an effective technique for allowing neural MT systems to handle (more) open-vocabulary translation.
- (2016) Pointing the Unknown Words
- TLDR: Proposes a copy-mechanism for allowing MT systems to more effectively copy words from a source context sequence.
- (2016) Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
- TLDR: A wonderful case-study demonstrating what a production-capacity machine translation system (in this case that of Google) looks like.
- (2013) Semantic Parsing on Freebase from Question-Answer Pairs π‘ πΌ
- TLDR: Proposes an elegant technique for semantic parsing that learns directly from question-answer pairs, without the need for annotated logical forms, allowing the system to scale up to Freebase.
- (2014) Semantic Parsing via Paraphrasing
- TLDR: Develops a unique paraphrase model for learning appropriate candidate logical forms from question-answer pairs, improving SOTA on existing Q/A datasets.
- (2015) Building a Semantic Parser Overnight πΌ
- TLDR: Neat paper showing that a semantic parser can be built from scratch starting with no training examples!
- (2015) Bringing Machine Learning and Computational Semantics Together
- TLDR: A nice overview of a computational semantics framework that uses machine learning to effectively learn logical forms for semantic parsing.
- (2016) A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task
- TLDR: A great wake-up call paper, demonstrating that SOTA performance can be achieved on certain reading comprehension datasets using simple systems with carefully chosen features. Don't forget non-deep learning methods!
- (2017) SQuAD: 100,000+ Questions for Machine Comprehension of Text πΌ
- TLDR: Introduces the SQUAD dataset, a question-answering corpus that has become one of the defacto benchmarks used today.
- (2004) ROUGE: A Package for Automatic Evaluation of Summaries πΌ
- TLDR: Introduces ROUGE, an evaluation metric for summarization that is used to this day on a variety of sequence transduction tasks.
- (2015) Semantically Conditioned LSTM-based Natural Language Generation for Spoken Dialogue Systems
- TLDR: Proposes a neural natural language generator that jointly optimises sentence planning and surface realization, outperforming other systems on human eval.
- (2016) Pointing the Unknown Words
- TLDR: Proposes a copy-mechanism for allowing MT systems to more effectively copy words from a source context sequence.
- (2017) Get To The Point: Summarization with Pointer-Generator Networks
- TLDR: This work offers an elegant soft copy mechanism, that drastically outperforms the SOTA on abstractive summarization.
- (2011) Data-drive Response Generation in Social Media
- TLDR: Proposes using phrase-based statistical machine translation methods to the problem of response generation.
- (2015) Semantically Conditioned LSTM-based Natural Language Generation for Spoken Dialogue Systems
- TLDR: Proposes a neural natural language generator that jointly optimises sentence planning and surface realization, outperforming other systems on human eval.
- (2016) How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation π‘
- TLDR: Important work demonstrating that existing automatic metrics used for dialogue woefully do not correlate well with human judgment.
- (2016) A Network-based End-to-End Trainable Task-oriented Dialogue System
- TLDR: Proposes a neat architecture for decomposing a dialogue system into a number of individually-trained neural network components.
- (2016) A Diversity-Promoting Objective Function for Neural Conversation Models
- TLDR: Introduces a maximum mutual information objective function for training dialogue systems.
- (2016) The Dialogue State Tracking Challenge Series: A Review
- TLDR: A nice overview of the dialogue state tracking challenges for dialogue systems.
- (2017) A Copy-Augmented Sequence-to-Sequence Architecture Gives Good Performance on Task-Oriented Dialogue
- TLDR: Shows that simple sequence-to-sequence architectures with a copy mechanism can perform competitively on existing task-oriented dialogue datasets.
- (2017) Key-Value Retrieval Networks for Task-Oriented Dialogue πΌ
- TLDR: Introduces a new multidomain dataset for task-oriented dataset as well as an architecture for softly incorporating information from structured knowledge bases into dialogue systems.
- (2017) Learning Symmetric Collaborative Dialogue Agents with Dynamic Knowledge Graph Embeddings πΌ
- TLDR: Introduces a new collaborative dialogue dataset, as well as an architecture for representing structured knowledge via knowledge graph embeddings.
- (2017) Hybrid Code Networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning
- TLDR: Introduces a hybrid dialogue architecture that can be jointly trained via supervised learning as well as reinforcement learning and combines neural network techniques with fine-grained rule-based approaches.
- (1971) Procedures as a Representation for Data in a Computer Program for Understanding Natural Language
- TLDR: One of the seminal papers in computer science, introducing SHRDLU an early system for computers understanding human language commands.
- (2016) Learning language games through interaction
- TLDR: Introduces a novel setting for interacting with computers to accomplish a task where only natural language can be used to communicate with the system!
- (2017) Naturalizing a programming language via interactive learning
- TLDR: Very cool work allowing a community of workers to iteratively naturalize a language starting with a core set of commands in an interactive task.
- (1996) An Empirical Study of Smoothing Techniques for Language Modelling
- TLDR: Performs an extensive survey of smoothing techniques in traditional language modelling systems.
- (2003) A Neural Probabilistic Language Model π‘
- TLDR: A seminal work in deep learning for NLP, introducing one of the earliest effective models for neural network-based language modelling.
- (2014) One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling πΌ
- TLDR: Introduces the Google One Billion Word language modelling benchmark.
- (2015) Character-Aware Neural Language Models
- TLDR: Proposes a language model using convolutional neural networks that can employ character-level information, performing on-par with word-level LSTM systems.
- (2016) Exploring the Limits of Language Modeling
- TLDR: Introduces a mega language model system using deep learning that uses a variety of techniques and significantly performs the SOTA on the One Billion Words Benchmark.
- (2018) Deep contextualized word representations π‘ πΌ
- TLDR: This paper introduces ELMO, a super powerful collection of word embeddings learned from the intermediate representations of a deep bidirectional LSTM language model. Achieved SOTA on 6 diverse NLP tasks.
- (2018) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding π‘
- TLDR: One of the most important papers of 2018, introducing BERT a powerful architecture pretrained using language modelling which is then effectively transferred to other domain-specific tasks.
- (2019) XLNet: Generalized Autoregressive Pretraining for Language Understanding π‘
- TLDR: Generalized autoregressive pretraining method that improves upon BERT by maximizing the expected likelihood over all permutations of the factorization order.
- (1997) Long Short-Term Memory π‘
- TLDR: Introduces the LSTM recurrent unit, a cornerstone of modern neural network-based NLP
- (2000) Maximum Entropy Markov Models for Information Extraction and Segmentation π‘
- TLDR: Introduces Markov Entropy Markov models for information extraction, a commonly used ML technique in classical NLP.
- (2010) From Frequency to Meaning: Vector Space Models of Semantics
- TLDR: A wonderful survey of existing vector space models for learning semantics in text.
- (2012) An Introduction to Conditional Random Fields
- TLDR: A nice, in-depth overview of conditional random fields, a commonly-used sequence-labelling model.
- (2014) Glove: Global vectors for word representation π‘ πΌ
- TLDR: Introduces Glove word embeddings, one of the most commonly used pretrained word embedding techniques across all flavors of NLP models
- (2014) Donβt count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors
- TLDR: Important paper demonstrating that context-predicting distributional semantics approaches outperform count-based techniques.
- (2015) Improving Distributional Similarity with Lessons Learned From Word Embeddings π‘
- TLDR: Demonstrates that traditional distributional semantics techniques can be enhanced with certain design choices and hyperparameter optimizations that make their performance rival that of neural network-based embedding methods.
- (2018) Universal Language Model Fine-tuning for Text Classification
- TLDR: Provides a smorgasbord of nice techniques for finetuning language models that can be effectively transferred to text classification tasks.
- (2019) Analogies Explained: Towards Understanding Word Embeddings
- TLDR: Very nice work providing a mathematical formalism for understanding some of the paraphrasing properties of modern word embeddings.