Stop words are common words that don’t play a big role in classification of text. Search engines often ignore them because they don’t really help narrow down the results for a given search phrase. A, the, it, he, she, and an are common stop words in English.
Because they don’t provide much value, it’s beneficial to remove stop words before processing text for natural language processing (NLP) tasks. Imagine how much larger your databases would have to be if all these common words were included?
The Python NLTK library contains a default list of stop words. To remove stop words, you need to divide your text into tokens (words), and then check if each token matches words in your list of stop words. If the token matches a stop word, you ignore the token. Otherwise you add the token to the list of valid words.
In this tutorial, we’ll teach you how to remove stop words from text using the NLTK library for Python.
Installing Required Libraries
To run the Python scripts in this tutorial, you’ll need to install the NLTK Library. Execute the following script on your command terminal to install the NLTK Library.
pip install nltk
Stop Words Removal
The NLTK library supports stop word removal from a variety of languages. To see the list of languages supported by NLTK for stop word removal, call the fieldids()
function from the
from nltk.corpus import stopwords
print(stopwords.fileids())
Here is a list of all the languages supported by the NLKT library for stop words removal.
['arabic', 'azerbaijani', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish', 'swedish', 'tajik', 'turkish']
We’re going to work with English stop words first, then we’ll show you a French example in case you happen to be developing a multi-lingual NLP tool.
Removing English Stop Words
To get a list of English stop words, you have to pass stopwords.words()
function as shown below.
print(stopwords.words('english'))
A list of all English stop words included in the NLTK library is included below:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
To remove stop words from a text string, you need to divide your text into tokens (words). Next, you’ll iterate through the list of tokens and keep only those tokens that are not present in the list of stop words. Here’s an example.
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
document = "In Python, you do not need to end a statement with a semicolon."
tokens = word_tokenize(document)
filtered_text = [t for t in tokens if not t in stopwords.words("english")]
print(" ".join(filtered_text))
The output shows that the stop words like you, do, not, to, a, and with are removed from the text as shown below:
In Python , need end statement semicolon .
Remember, you can use the RegexpTokenizer to remove punctuation from your list of tokens.
Removing French Stop Words
To get a list of French stop words, you have to pass stopwords.words()
function as shown below.
print(stopwords.words('french'))
Here’s a list of all the French stop words:
['au', 'aux', 'avec', 'ce', 'ces', 'dans', 'de', 'des', 'du', 'elle', 'en', 'et', 'eux', 'il', 'ils', 'je', 'la', 'le', 'les', 'leur', 'lui', 'ma', 'mais', 'me', 'même', 'mes', 'moi', 'mon', 'ne', 'nos', 'notre', 'nous', 'on', 'ou', 'par', 'pas', 'pour', 'qu', 'que', 'qui', 'sa', 'se', 'ses', 'son', 'sur', 'ta', 'te', 'tes', 'toi', 'ton', 'tu', 'un', 'une', 'vos', 'votre', 'vous', 'c', 'd', 'j', 'l', 'à', 'm', 'n', 's', 't', 'y', 'été', 'étée', 'étées', 'étés', 'étant', 'étante', 'étants', 'étantes', 'suis', 'es', 'est', 'sommes', 'êtes', 'sont', 'serai', 'seras', 'sera', 'serons', 'serez', 'seront', 'serais', 'serait', 'serions', 'seriez', 'seraient', 'étais', 'était', 'étions', 'étiez', 'étaient', 'fus', 'fut', 'fûmes', 'fûtes', 'furent', 'sois', 'soit', 'soyons', 'soyez', 'soient', 'fusse', 'fusses', 'fût', 'fussions', 'fussiez', 'fussent', 'ayant', 'ayante', 'ayantes', 'ayants', 'eu', 'eue', 'eues', 'eus', 'ai', 'as', 'avons', 'avez', 'ont', 'aurai', 'auras', 'aura', 'aurons', 'aurez', 'auront', 'aurais', 'aurait', 'aurions', 'auriez', 'auraient', 'avais', 'avait', 'avions', 'aviez', 'avaient', 'eut', 'eûmes', 'eûtes', 'eurent', 'aie', 'aies', 'ait', 'ayons', 'ayez', 'aient', 'eusse', 'eusses', 'eût', 'eussions', 'eussiez', 'eussent']
The following example shows how to remove stop words from French text. The process is similar to removing English stop words; you have to check if the words in your original text strings exist in the list of the French stop words. You ignore words that exist in the list of stop words and return the remaining words. Finally, you can concatenate the list of words without stop words to reconstruct your strings without stop words using the join()
function.
Here’s an example:
document = "Je suis un étudiant en littérature."
tokens = word_tokenize(document)
filtered_text = [t for t in tokens if not t in stopwords.words("french")]
print(" ".join(filtered_text))
In the output, you’ll see that the French stop words i.e. suis, un, and en are removed from the text.
Je étudiant littérature .
Get Our Python Developer Kit for Free
I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.
Adding or Removing Stop Words to Default Lists in NLTK
Since stop words in NLTK are stored as a list of strings, you’re able to add or remove items to the list just like you would with any other Python list. We’ll show you how to customize your stop words list in this section.
Adding Stop Words to the NLTK Stop Words List
To add new stop words to the default list of stop words in NLTK, simply use the append()
function, like this:.
document = "In Python, you do not need to end a statement with a semicolon."
tokens = word_tokenize(document)
updated_stopwords = stopwords.words("English")
updated_stopwords.append("end")
filtered_text = [t for t in tokens if not t in updated_stopwords]
print(" ".join(filtered_text))
In the script above, we added the word end to the default list of English stop words. In the output below, you should see that the word end has been removed from the text.
Output:
In Python , need statement semicolon .
In addition to adding a single word to the default list of NLTK stop words, you can use the extend()
function to add a list of words to the default stop words list. Here’s an example:
document = "In Python, you do not need to end a statement with a semicolon."
tokens = word_tokenize(document)
updated_stopwords = stopwords.words("English")
updated_stopwords.extend(["end","need"])
filtered_text = [t for t in tokens if not t in updated_stopwords]
print(" ".join(filtered_text))
Output:
In Python , statement semicolon .
Removing Stop Words from the NLTK Stop Words List
Finally, you can remove stop words from the default NLTK list of stop words, too. To do so, use the remove()
function and pass it the stop word you want removed. For reference, have a look at the following example where we remove the stop word with from the default list of English stop words in NLTK.
document = "In Python, you do not need to end a statement with a semicolon."
tokens = word_tokenize(document)
updated_stopwords = stopwords.words("English")
updated_stopwords.remove("with")
filtered_text = [t for t in tokens if not t in updated_stopwords]
print(" ".join(filtered_text))
In the output below, you can see that the stop word with is no longer stripped from the tokenized text:
In Python , need end statement with semicolon .
Get Our Python Developer Kit for Free
I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.