Today we’re going to teach you how to perform stemming in Python using the Natural Language Toolkit (NLTK) library. Before we can do that, though, we need to review what stemming is.

Natural languages contain different words with the same stem. A stem, also known as a base, is the part of a word that’s left after removing the ending. For example the stem sav- is used to create words like saves, saved, saving, saver etc. Similarly, the stem of words computer, computed, and computing is comput-. The process of reducing words to their stem forms is called stemming.

Stemming is one of the most important pre-processing tasks in Natural Language Processing. For instance, while developing a machine learning based text classification system, reducing words to their stems can help reduce variance and size of the textual features which in turn improves the training time of a machine learning model.

Let’s see how to perform stemming with the Python NLTK library.

Installing Required Libraries

To run the scripts in this article, you’ll need to install the NLTK Library. Execute the following script in your command terminal to install the NLTK Library for Python.

pip install nltk

NLTK library contains two types of stemmers for reducing words to their root form: Porter stemmer and snowball stemmer.

Stemming using Porter Stemmer

The Porter stemmer algorithm was written by Martin Porter and published in 1980. The NLTK library incorporates his stemming algorithm and we’re going to apply it on a list of words and a sentence in this section.

To perform stemming via the NLTK Porter stemmer, you need to import the PorterStemmer class. Next, you need to pass the word to the stem() function which returns the stem of the word passed as a parameter to it. The following script performs stemming on a list of words.

from nltk.stem import PorterStemmer
tokens = ["computer", "computed", "compute", "computing"]
port_stem =PorterStemmer()
for token in tokens :
    stem = port_stem.stem(token)
    print(token, "=>", stem)

Output:

computer => comput
computed => comput
compute => comput
computing => comput

From the output, you can see that the words computer, computed, compute, and computing are reduced to their stem, compute.

In addition to applying stemming on a list of words, you can also apply stemming on complete sentences with Python. First you need to divide a sentence into tokens using NLTK word tokenizer. Next, you can iteratively apply the stem() function from the PorterStemmer algorithm to reduce words in the input sentence to their stems. Here’s an example:

from nltk.stem import PorterStemmer
sentence = "Computers are used for computing"
tokens = nltk.word_tokenize(sentence )

port_stem =PorterStemmer()
for token in tokens :
    stem = port_stem.stem(token)
    print(token, "=>", stem)

Output:

Computers => comput
are => are
used => use
for => for
computing => comput

Code More, Distract Less: Support Our Ad-Free Site

You might have noticed we removed ads from our site - we hope this enhances your learning experience. To help sustain this, please take a look at our Python Developer Kit and our comprehensive cheat sheets. Each purchase directly supports this site, ensuring we can continue to offer you quality, distraction-free tutorials.


Stemming using Snowball Stemmer

To perform stemming via the snowball stemmer, you have to import the SnowballStemmer class from the NLTK module. The process of stemming with the snowball stemmer is similar to Porter stemmer. However, with the snowball stemmer, you need to specify the language for the text that you want to stem. A list of all the languages supported by the snowball stemmer can be printed via the following script.

print(SnowballStemmer.languages)

Output:

('arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')

Let’s look at a simple stemming example using the snowball stemmer. We’ll apply stemming on the same sentence as we did in our Porter stemming example.

from nltk.stem import SnowballStemmer
import nltk
sentence = "Computers are used for computing"

tokens = nltk.word_tokenize(sentence )

sb_stem =SnowballStemmer("english")
for token in tokens :
    stem = sb_stem.stem(token)
    print(token, "=>", stem)

Output:

Computers => comput
are => are
used => use
for => for
computing => comput

The major difference between the Porter and snowball stemmer lies in their execution speeds. Snowball stemmer is considered faster and more memory efficient than the Porter stemmer. Both of them are great for performing stemming with Python! If you enjoyed this tutorial, I hope you’ll subscribe using the form below. We’ll continue with our natural language processing theme in our text tutorial.


Code More, Distract Less: Support Our Ad-Free Site

You might have noticed we removed ads from our site - we hope this enhances your learning experience. To help sustain this, please take a look at our Python Developer Kit and our comprehensive cheat sheets. Each purchase directly supports this site, ensuring we can continue to offer you quality, distraction-free tutorials.