Python – Stemming and Lemmatization

Python – Stemming and Lemmatization

In the areas of Natural Language Processing we come across situation where two or more words have a common root. For example, the three words - agreed, agreeing and agreeable have the same root word agree. A search involving any of these words should treat them as the same word which is the root word. So it becomes essential to link all the words into their root word. The NLTK library has methods to do this linking and give the output showing the root word.
The below program uses the Porter Stemming Algorithm for stemming.

import nltk
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
data = "Capitalism is better than communism"
# First Word tokenization
nltk_tokens = nltk.word_tokenize(word_data)
#Next find the roots of the word
for w in nltk_tokens:
print "Actual: %s  Stem: %s"  % (w,porter_stemmer.stem(w))

When we execute the above code, it produces the following result.

Actual: Capitalism  Stem: Capitalism
Actual: is  Stem: is
Actual: better  Stem: better
Actual: than  Stem: than
Actual: communism  Stem: communism

Lemmatization is similar ti stemming but it brings context to the words.So it goes a steps further by linking words with similar meaning to one word. For example if a paragraph has words like cars, trains and automobile, then it will link all of them to automobile. In the below program we use the WordNet lexical database for lemmatization.

import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
data = "Capitalism is better than communism"
nltk_tokens = nltk.word_tokenize(data)
for w in nltk_tokens:
print "Actual: %s  Lemma: %s"  % (w,wordnet_lemmatizer.lemmatize(w))

When we execute the above code, it produces the following result.

Actual: Capitalism  Lemma: Capitalism
Actual: is  Lemma: is
Actual: better  Lemma: better
Actual: than  Lemma: than
Actual: communism  Lemma: communism
Python – Word Tokenization (Prev Lesson)
(Next Lesson) Python – Chart Properties
', { 'anonymize_ip': true });