# Preprocessing
# Stop words
Some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words.
The general trend in IR systems over time has been from standard use of quite large stop lists (200-300 terms) to very small stop lists (7-12 terms) to no stop list whatsoever. Web search engines generally do not use stop lists.
from sklearn.feature_extraction import stop_words
sklearn_stop_words = sorted(list(stop_words.ENGLISH_STOP_WORDS))
# sorted(list(stop_words.ENGLISH_STOP_WORDS))[:20]
sklearn_stop_words[:20]
From Intro to Information Retrieval Books
The general trend in IR systems over time has been from standard use of quite large stop lists (200-300 terms) to very small stop lists (7-12 terms) to no stop list whatsoever. Web search engines generally do not use stop lists.
# Stemming and Lemmatization
Stemming and Lemmatization both generate the root form of the words.
Lemmatization uses the rules about a language. The resulting tokens are all actual words
"Stemming is the poor-man’s lemmatization." (Noah Smith, 2011) Stemming is a crude heuristic that chops the ends off of words. The resulting tokens may not be actual words. Stemming is faster.
Spacy Library only offers lemmatization; because doesn't believe in stemming.
Are the below words the same?
organize, organizes, and organizing
democracy, democratic, and democratization

Note: Stemming and Lemmatization are language dependant. It might be useful for languages that have a high morphology (how modify verb) like Sanskrit.