# Intro

NLP is a broad field, encompassing a variety of tasks, including:

Part-of-speech tagging: identify if each word is a noun, verb, adjective, etc.)
Named entity recognition NER): identify person names, organizations, locations, medical codes, time expressions, quantities, monetary values, etc)
Question answering Speech recognition Text-to-speech and Speech-to-text
Topic modeling
Sentiment classification
Language modeling Translation

# Things Change

# Spell Checkers

Historically, spell checkers required thousands of lines of hard-coded rules:

original

A version that uses historical data and probabilities can be written in far fewer lines:

ml based

No longer need to stem , use stop words. change

# Yann LeCun vs. Chris Manning

Another interesting discussion along the topic of how much linguistic structure to incorporate into NLP models is between Yann LeCun and Chris Manning:

Deep Learning, Structure and Innate Priors: A Discussion between Yann LeCun and Christopher Manning:

On one side, Manning is a prominent advocate for incorporating more linguistic structure into deep learning systems. On the other, LeCun is a leading proponent for the ability of simple but powerful neural architectures to perform sophisticated tasks without extensive task-specific feature engineering. For this reason, anticipation for disagreement between the two was high, with one Twitter commentator describing the event as “the AI equivalent of Batman vs Superman”.

...

Manning described structure as a “necessary good” (9:14), arguing that we should have a positive attitude towards structure as a good design decision. In particular, structure allows us to design systems that can learn more from less data, and at a higher level of abstraction, compared to those without structure.

Conversely, LeCun described structure as a “necessary evil” (2:44), and warned that imposing structure requires us to make certain assumptions, which are invariably wrong for at least some portion of the data, and may become obsolete within the near future. As an example, he hypothesized that ConvNets may be obsolete in 10 years (29:57).

# Python Libraries

nltk: first released in 2001, very broad NLP library
spaCy: creates parse trees, excellent tokenizer, opinionated
gensim: topic modeling and similarity detection specialized tools:

PyText: fastText has library of embeddings general ML/DL libraries with text features:

sklearn: general purpose Python ML library fastai: fast & accurate neural nets using modern best practices, on top of PyTorch

Introducing Metadata Enhanced ULMFiT (opens new window), classifying quotes from articles. Uses metadata (such as publication, country, and source) together with the text of the quote to improve accuracy of the classifier.

meta_ulmfit

Preprocessing →