NLP Guide

This is an open guide to natural language processing.

Edit this guide at github.com/NLPGuide

Progress

See also: NLP Guide for 2018 and 2017

NLP Progress
nlpprogress.com

EFF AI Metrics - Written Language, Spoken Language
eff.org/ai/metrics

AI Index - Natural Language Understanding
aiindex.org

Demos

explosion.ai/demos

Language

Definitions of language are often political and not consistent. For our purposes by default we defer to https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes. Some major libs and services use non-standard codes. (For example, Google Translate uses jw not jv for Javanese, and iw not he for Hebrew.) Confusion with country codes is common.

Tasks

parsing
translation
named-entity recognition
natural language understanding
natural language generation
speech - recognition, synthesis
classification - sentiment analysis, spam detection
relation extraction
coreference resolution

Applications

translation: Google, Microsoft, Baidu, Yandex, DeepL, Babelfish
agents: Siri, Google Now, Google Assistant, Amazon Alexa, Microsoft Cortana, IBM Watson

Most natural language processing is not standalone products but integrations in nearly all major applications, for example:

search engines: Google, Bing, Baidu, Yandex
marketplaces: eBay, Amazon, Ali Baba
social networks: Facebook, Twitter, Reddit, Quora, StackExchange
integrated as recommendations, spam filtering, spelling correction, advert matching, entity recognition

Libs

NLTK, Moses, Giza++, OpenNLP
spaCy, Stanford NLP, Berkeley, TF syntaxnet
Universal Dependencies
AllenNLP
THUMT, TF seq2seq, pytorch seq2seq

GitHub: natural-language-processing, nlp

Which programming languages? C/C++, Java ==> python, NodeJS

APIs

Google Cloud, Azure ML, AWS, IBM Nuance, Indico, ABBYY
EventRegistry

How do choose? Language support (programming and human), latency and price… and last but not least, quality.

OS/Browser Integrations

speech recognition, speech synthesis, OCR, handwriting recognition

Datasets

Wikipedia, Common Crawl
Universal Dependencies
Penn Tree Bank
WMT workshop data
CLEVR
SQuAD
Enron emails
OPUS open parallel corpus
WordNet
NLTK Corpora

For which tasks is there data? For which languages is there data?

Benchmarks

GLUE

BLEU, METEOR

Research Orgs

Google, Bing, Baidu, Yandex, Amazon, Facebook Twitter, Fuji Xerox, NTT, Rakuten, SAP, eBay

Stanford, Johns Hopkins, U of Washington, U of Maryland, USC ISI, CMU, Cornell U de Montréal
U of Edinburgh, Sheffield, Cambridge DFKI, U of Aachen, Stuttgart, Heidelberg, TU Berlin, Munich, Darmstadt, Humboldt U
Bar-Ilan U, Hebrew U Tsinghua U, Peking U, Chinese Academy of Sciences U Kyoto, Tokyo

SRI/DARPA

Who is missing? Relative to their output in general: Apple, Samsung, eBay, Oracle, Xerox, MIT and Ivy League, Technion, UCx, Switzerland, India, France, Russia and Eastern Europe, OpenAI

Papers

arXiv cs.CL: archive recent
arXiv cs.LG: archive recent

[1] Antoine Bordes, et al. “Joint Learning of Words and Meaning Representations for Open-Text Semantic Parsing.” AISTATS(2012) [pdf]

[2] Mikolov, et al. “Distributed representations of words and phrases and their compositionality.” ANIPS(2013): 3111-3119 [pdf] (word2vec)

[3] Sutskever, et al. ““Sequence to sequence learning with neural networks.” ANIPS(2014) [pdf]

[4] Ankit Kumar, et al. ““Ask Me Anything: Dynamic Memory Networks for Natural Language Processing.” arXiv preprint arXiv:1506.07285(2015) [pdf]

[5] Yoon Kim, et al. “Character-Aware Neural Language Models.” NIPS(2015) arXiv preprint arXiv:1508.06615(2015) [pdf]

[6] Jason Weston, et al. “Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks.” arXiv preprint arXiv:1502.05698(2015) [pdf] (bAbI tasks)

[7] Karl Moritz Hermann, et al. “Teaching Machines to Read and Comprehend.” arXiv preprint arXiv:1506.03340(2015) [pdf] (CNN/DailyMail cloze style questions)

[8] Alexis Conneau, et al. “Very Deep Convolutional Networks for Natural Language Processing.” arXiv preprint arXiv:1606.01781(2016) [pdf] (state-of-the-art in text classification)

[9] Armand Joulin, et al. “Bag of Tricks for Efficient Text Classification.” arXiv preprint arXiv:1607.01759(2016) [pdf] (slightly worse than state-of-the-art, but a lot faster)

RNN and Sequence-to-Sequence Models

[34] Graves, Alex. “Generating sequences with recurrent neural networks.” arXiv preprint arXiv:1308.0850 (2013). [pdf] (LSTM, very nice generating result, show the power of RNN)

[35] Cho, Kyunghyun, et al. “Learning phrase representations using RNN encoder-decoder for statistical machine translation.” arXiv preprint arXiv:1406.1078 (2014). [pdf] (First Seq-to-Seq Paper)

[36] Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. “Sequence to sequence learning with neural networks.” Advances in neural information processing systems. 2014. [pdf] (Outstanding Work)

[37] Bahdanau, Dzmitry, KyungHyun Cho, and Yoshua Bengio. “Neural Machine Translation by Jointly Learning to Align and Translate.” arXiv preprint arXiv:1409.0473 (2014). [pdf]

[38] Vinyals, Oriol, and Quoc Le. “A neural conversational model.” arXiv preprint arXiv:1506.05869 (2015). [pdf] (Seq-to-Seq on Chatbot)

Speech

[8] Hinton, Geoffrey, et al. “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups.” IEEE Signal Processing Magazine 29.6 (2012): 82-97. [pdf] (Breakthrough in speech recognition)

[9] Graves, Alex, Abdel-rahman Mohamed, and Geoffrey Hinton. “Speech recognition with deep recurrent neural networks.” 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 2013. [pdf] (RNN)

[10] Graves, Alex, and Navdeep Jaitly. “Towards End-To-End Speech Recognition with Recurrent Neural Networks.” ICML. Vol. 14. 2014. [pdf]

[11] Sak, Haşim, et al. “Fast and accurate recurrent neural network acoustic models for speech recognition.” arXiv preprint arXiv:1507.06947 (2015). [pdf] (Google Speech Recognition System)

[12] Amodei, Dario, et al. “Deep speech 2: End-to-end speech recognition in english and mandarin.” arXiv preprint arXiv:1512.02595 (2015). [pdf] (Baidu Speech Recognition System)

[13] W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, G. Zweig “Achieving Human Parity in Conversational Speech Recognition.” arXiv preprint arXiv:1610.05256 (2016). [pdf] (State-of-the-art in speech recognition, Microsoft)

Startups

angel.co/natural-language-processing

crunchbase.com/…/natural-language-processing

Unbabel, Lilt, Aylien

acquired: Siri (Apple), Viv (Samsung), WordLens (Google), wit.ai (Facebook),

Events

ACL

EMNLP

CoNLL

EACL

NIST TAC

PyData

RAAIS

NIPS

Google IO, F8, AWS DevDay…

Competitions

WMT workshops, conferences
Kaggle text-data, linguistics, languages, literature
annotated datasets like SQuAD and CLEVR
Winograd Schema Challenge (wiki)

Learning

Courses

http://cs224n.stanford.edu / http://cs224d.stanford.edu
https://www.youtube.com/watch?v=OQQ-W_63UgQ&list=PL3FW7Lu3i5Jsnh1rnUwq_TcylNr7EkRe6

https://github.com/oxford-cs-deepnlp-2017/lectures

https://github.com/jacobeisenstein/gt-nlp-class/

http://phontron.com/class/mtandseq2seq2018/

https://www.youtube.com/watch?v=OzZIOiMVUyM&index=79&list=PL6Xpj9I5qXYEcOhn7TqghAJ6NAPrNmUBH

Tutorials

https://fasttext.cc/docs/en/supervised-tutorial.html

http://thestraightdope.mxnet.io/chapter05_recurrent-neural-networks/simple-rnn.html

http://pytorch.org/tutorials/beginner/deep_learning_nlp_tutorial.html

http://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html

https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html

Books

Linguistic Fundamentals for Natural Language Processing: 100 Essentials from Morphology and Syntax
2013
Emily M. Bender, University of Washington

Neural Network Methods for Natural Language Processing 2017
Yoav Goldberg, Bar-Ilan University
Graeme Hirst, University of Toronto

A Primer on Neural Network Models for Natural Language Processing
2015 draft
Yoav Goldberg, Bar-Ilan University

Speech and Language Processing (3rd ed. draft)
2017 draft
Dan Jurafsky, Stanford University
James H. Martin, University of Colorado

Foundations of Statistical Natural Language Processing
1999
Christopher Manning, Stanford University
Hinrich Schütze, University of Munich

Introduction to Information Retrieval
2008
Christopher Manning, Stanford University
Prabhakar Raghavan, Google Hinrich Schütze, University of Munich

News Sources

https://nlp.stanford.edu/read/

https://explosion.ai/blog/

https://yerevann.github.io/

http://approximatelycorrect.com/category/natural-language-processing/

http://newsletter.ruder.io/

https://www.producthunt.com/@bittlingmayer/collections

http://nathan.ai

More Reading

http://mitp.nautil.us/article/170/last-words-computational-linguistics-and-deep-learning

http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/

https://explosion.ai/blog/quora-deep-text-pair-classification

http://ruder.io/highlights-emnlp-2017/index.html
http://ruder.io/word-embeddings-2017/

http://approximatelycorrect.com/2017/09/26/a-random-walk-through-emnlp-2017/

http://norvig.com/spell-correct.html
http://norvig.com/chomsky.html

Fora

#nlproc

https://www.reddit.com/r/LanguageTechnology/

https://www.reddit.com/r/machinetranslation/

https://plus.google.com/communities/112547995826249627629

http://linguistics.stackexchange.com/

https://stackoverflow.com/questions/tagged/nlp

https://datascience.stackexchange.com/questions/tagged/nlp

https://opendata.stackexchange.com/questions/tagged/nlp

TODO

related fields - machine learning, linguistics
subfields
guide for engineers vs guide for ML researchers