Text documents contain sentences, which in turn contain words. When working with languages in Python, you often need to divide walls of text into individual sentences, and sentences into individual words.

For instance, if you want to perform parts of speech tagging or named entity recognition, you’ll first need to pull out the words from a series of sentences. Similarly, if you want to perform Python sentiment analysis for each sentence in your document, you’ll need to divide the document into sentences. This is where tokenization comes into play.

Tokenization is the process of dividing text into sentences, and sentences into words. Specifically, the process of dividing a text document into sentences is called sentence tokenization while dividing a sentence into words is called word tokenization.

The Natural Language Toolkit (NLTK) is the most commonly used Natural Language Processing (NLP) library for both sentence and word tokenization. In this tutorial, we’ll show you how to perform NLTK tokenization with Python.

Installing Required Libraries

To run the scripts in this tutorial, go ahead and install the NLTK Library by executing the following script on your command terminal.

pip install nltk

Python Word Tokenization

The first thing we’ll do is divide a sentence into individual tokens or words. Word tokenization is often a prerequisite to many other NLP tasks, such as parts of speech tagging and chunking. It’s related to Python lemmatization, but lemmatization breaks down an individual word to its dictionary form.

NLTK Tokenization with Punctuations

The word_tokenize() function from the NLTK library is used to tokenize words in a sentence as shown in the following example. By default the word_tokenize() function returns a list of tokenized words including punctuations.

import nltk
document = "In Python, you do not need to end a statement with a semicolon."
word_tokens = nltk.word_tokenize(document)
print(word_tokens)

The output shows the list of tokenized words. Notice the comma and period were both included in the tokenized output (, and .).

Output:

['In', 'Python', ',', 'you', 'do', 'not', 'need', 'to', 'end', 'a', 'statement', 'with', 'a', 'semicolon', '.']

NLTK Tokenization without Punctuations

What if you don’t want to pull out the punctuation marks from your sentences? If you want to tokenize a sentence into words without including punctuations in the output, the RegexpTokenizer object from the nltk.tokenize module is your best bet. The tokenize() method from the RegexpTokenizer can then be used to tokenize a sentence into words without including punctuations in the output.

Here is an example:

from nltk.tokenize import RegexpTokenizer

document = "In Python, you do not need to end a statement with a semicolon."
tokenizer = RegexpTokenizer(r'\w+')

word_tokens = tokenizer.tokenize(document)
print(word_tokens)

From the output, you can see that the comma and period are no longer included in the list of tokenized items. That’s because the \w is a predefined Python regular expression character class that only includes alphanumeric characters (and an underscore).

Output:

['In', 'Python', 'you', 'do', 'not', 'need', 'to', 'end', 'a', 'statement', 'with', 'a', 'semicolon']
Can't get enough Python?

Enter your email address for more free Python tutorials and tips.

Python is powerful! Show me more free Python tips

Python Sentence tokenization

In sentence tokenization, a text file or other document is divided into a list of sentences. To perform sentence tokenization with the Python NLTK library, the sent_tokenize() function is used. Let’s take a look at an example:

import nltk
document = """Recent research has increasingly focused on unsupervised and semi-supervised learning algorithms. Such algorithms can learn from data that has not been hand-annotated with the desired answers or using a combination of annotated and non-annotated data. Generally, this task is much more difficult than supervised learning, and typically produces less accurate results for a given amount of input data. However, there is an enormous amount of non-annotated data available (including, among other things, the entire content of the World Wide Web), which can often make up for the inferior results if the algorithm used has a low enough time complexity to be practical."""

sentence_tokens = nltk.sent_tokenize(document)
for sent in sentence_tokens:
    print("-",sent)

Output:

- Recent research has increasingly focused on unsupervised and semi-supervised learning algorithms.
- Such algorithms can learn from data that has not been hand-annotated with the desired answers or using a combination of annotated and non-annotated data.
- Generally, this task is much more difficult than supervised learning, and typically produces less accurate results for a given amount of input data.
- However, there is an enormous amount of non-annotated data available (including, among other things, the entire content of the World Wide Web), which can often make up for the inferior results if the algorithm used has a low enough time complexity to be practical.

You can seen each sentence is successfully extracted from the original paragraph and printed as a new line in the output. From here, you can take the individual sentence_tokens and process them with the rest of your NLP code. Play around and try performing word tokenization on each tokenized sentence.

If you want more Python tutorials, including our latest NLTK and NLP projects, subscribe using the form below.

Can't get enough Python?

Enter your email address for more free Python tutorials and tips.

Python is powerful! Show me more free Python tips