Python Lemmatization with NLTK

Natural language consists of different words with same roots. For example, the verbs playing and played both originate from the word play. Similary, the adjectives better and best are based on the adjective good. The process of reducing words to a single dictionary form is called lemmatization.

In natural language processing, you often need to reduce words to their dictionary form. For instance, if you want to develop a machine learning based text classification model, reducing words to their dictionary form can help reduce feature set which in turn improves the training speed of a machine learning model.

Natural Language Toolkit (NLTK) is the most commonly used Natural Language Processing (NLP) library for lemmatization. In this tutorial, we’ll show you exactly how to perform lemmatization in Python with NLTK.

Installing Required Libraries

To run the Python scripts in this tutorial, you’ll need to install the NLTK Library. Execute the following code from your command terminal to install the NLTK Library.

pip install nltk

Lemmatization with NLTK

To perform lemmatization via the NLTK library in Python, you can use the WordNetLemmatizer class. The WordNetLemmatizer reduces words to their root forms as they appear in the Wordnet Database.

First, you need to create an object of the WordNetLemmatizer class.
Next, you need to pass the word that is to be lemmatized to the lemmatize() method of the WordNetLemmatizer object.

Here is an example where the word computer is being lemmatized.

import nltk
from nltk.stem import WordNetLemmatizer
wn_lemma = WordNetLemmatizer()
token = "computer"

lemma = wn_lemma.lemmatize(token)
print(token, "=>", lemma)

Output:

computer => computer

The output above shows that the word computer is lemmatized as computer with no change. This is because, by default, the lemmatize() method treats all the words passed to it as nouns. Since the lemmatized form of the noun computer is computer, the output echoes the input.

Similarly, by default the word computed doesn’t change after lemmatization since it’s treated as a noun. Here’s what we mean.

token = "computed"

lemma = wn_lemma.lemmatize(token)
print(token, "=>", lemma)

Output:

computed => computed

Lemmatizing Verbs

The lemmatize() method accepts a second argument, POS. This stands for part of speech and is used to tell the NLTK lemmatizer what type of word you’re trying to decompose.

For example, if you want a word to be treated as a verb instead of a noun, you need to pass the additional parameter v. The NLTK lemmatizer will then begin reducing the word to its verb form. We’ll use the word computed again in this example, but instead of defaulting to a noun, lets tell the NLTK lemmatizer to treat computed as a verb:

token = "computed"

lemma = wn_lemma.lemmatize(token, 'v')
print(token, "=>", lemma)

Output:

computed => compute

In the above script, the word computed is treated as the past tense of the verb compute. Hence, when you perform NLTK lemmatization you can see that the word <pre>computed</pre> has been reduced to its root form, compute.

Let’s see another example of lemmatization. In the following example, the verb computing is also lemmatized to compute.

token = "computing"

lemma = wn_lemma.lemmatize(token, 'v')
print(token, "=>", lemma)

Output:

computing => compute

Lemmatizing Adjectives

Like verbs, you can also lemmatize adjectives using the WordNetLemmatizer. To treat a word as an adjective, you have to pass a as the second parameter to the lemmatize() method, as shown below:

import nltk
from nltk.stem import 	WordNetLemmatizer
wn_lemma = WordNetLemmatizer()
token = "worst"

lemma = wn_lemma.lemmatize(token, 'a')
print(token, "=>", lemma)

Output:

worst => bad

The output shows that the adjective worst has been reduced to its root form, good. In the same way, we can reduce the word quickest to its root form of quick by treating it as an adjective. Here’s an example.

token = "quickest"

lemma = wn_lemma.lemmatize(token, 'a')
print(token, "=>", lemma)

Output:

quickest => quick

Lemmatizing Adverbs

You can also lemmatize adverbs by passing r as the second parameter to the lemmatize() method.

If you want more Python tutorials, including our latest NLTK and NLP projects, subscribe using the form below.

Python Lemmatization with NLTK

The Python Tutorials Blog

Installing Required Libraries

Lemmatization with NLTK

Lemmatizing Verbs

Lemmatizing Adjectives

Lemmatizing Adverbs

About The Python Tutorials Blog