Natural language consists of different words with same roots. For example, the verbs playing and played both originate from the word play. Similary, the adjectives better and best are based on the adjective good. The process of reducing words to a single dictionary form is called lemmatization.
In natural language processing, you often need to reduce words to their dictionary form. For instance, if you want to develop a machine learning based text classification model, reducing words to their dictionary form can help reduce feature set which in turn improves the training speed of a machine learning model.
Natural Language Toolkit (NLTK) is the most commonly used Natural Language Processing (NLP) library for lemmatization. In this tutorial, we’ll show you exactly how to perform lemmatization in Python with NLTK.
Installing Required Libraries
To run the Python scripts in this tutorial, you’ll need to install the NLTK Library. Execute the following code from your command terminal to install the NLTK Library.
pip install nltk
Lemmatization with NLTK
To perform lemmatization via the NLTK library in Python, you can use the
- First, you need to create an object of the
WordNetLemmatizer class. - Next, you need to pass the word that is to be lemmatized to the
lemmatize()
method of theWordNetLemmatizer object.
Here is an example where the word computer is being lemmatized.
import nltk
from nltk.stem import WordNetLemmatizer
wn_lemma = WordNetLemmatizer()
token = "computer"
lemma = wn_lemma.lemmatize(token)
print(token, "=>", lemma)
Output:
computer => computer
The output above shows that the word computer is lemmatized as computer with no change. This is because, by default, the lemmatize()
method treats all the words passed to it as nouns. Since the lemmatized form of the noun computer is computer, the output echoes the input.
Similarly, by default the word computed doesn’t change after lemmatization since it’s treated as a noun. Here’s what we mean.
token = "computed"
lemma = wn_lemma.lemmatize(token)
print(token, "=>", lemma)
Output:
computed => computed
Get Our Python Developer Kit for Free
I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.
Lemmatizing Verbs
The lemmatize()
method accepts a second argument,
For example, if you want a word to be treated as a verb instead of a noun, you need to pass the additional parameter
token = "computed"
lemma = wn_lemma.lemmatize(token, 'v')
print(token, "=>", lemma)
Output:
computed => compute
In the above script, the word computed is treated as the past tense of the verb compute. Hence, when you perform NLTK lemmatization you can see that the word <pre>computed</pre> has been reduced to its root form, compute.
Let’s see another example of lemmatization. In the following example, the verb computing is also lemmatized to compute.
token = "computing"
lemma = wn_lemma.lemmatize(token, 'v')
print(token, "=>", lemma)
Output:
computing => compute
Lemmatizing Adjectives
Like verbs, you can also lemmatize adjectives using the lemmatize()
method, as shown below:
import nltk
from nltk.stem import WordNetLemmatizer
wn_lemma = WordNetLemmatizer()
token = "worst"
lemma = wn_lemma.lemmatize(token, 'a')
print(token, "=>", lemma)
Output:
worst => bad
The output shows that the adjective worst has been reduced to its root form, good. In the same way, we can reduce the word quickest to its root form of quick by treating it as an adjective. Here’s an example.
token = "quickest"
lemma = wn_lemma.lemmatize(token, 'a')
print(token, "=>", lemma)
Output:
quickest => quick
Lemmatizing Adverbs
You can also lemmatize adverbs by passing lemmatize()
method.
If you want more Python tutorials, including our latest NLTK and NLP projects, subscribe using the form below.
Get Our Python Developer Kit for Free
I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.