Python Named Entity Recognition with NLTK & spaCy

Named entity recognition refers to identifying named entities from text. Named entities are real world objects, like people, products, locations, and dates. For example, “Eiffel Tower” is a named entity since it refers to a specific tower located in Paris. Similarly, “Christiano Ronaldo” is a named entity because it refers to a person. Automatic named entity recognition from text is one of the most important tasks in Natural Language Processing (NLP) since it helps you identify what the text is all about and what person, places, objects, organizations or entities are being discussed in the text.

In this tutorial, we’re going to show you exactly how to perform named entity recognition using Python’s NLTK and spaCy libraries.

Installing Required Libraries

To run the scripts in this tutorial, you’ll first need to install the NLTK and spaCy Libraries. Execute the following script on your command terminal to install the NLTK Library.

pip install nltk

Similarly, execute the following script on your command terminal to install the spaCy library.

pip install spacy

To download and install all the modules and objects required to support the NLTK library, you’ll need to run the following command inside your Python application:

import nltk
nltk.download('words')

Named Entity Recognition with NLTK

Python’s NLTK library contains a named entity recognizer called MaxEnt Chunker which stands for maximum entropy chunker. To call the maximum entropy chunker for named entity recognition, you need to pass the parts of speech (POS) tags of a text to the ne_chunk() function of the NLTK library. As an example, in this section, we’ll try to find named entities from the following sentence.

sentence = """Born and raised in Madeira, Ronaldo began his senior club career playing for Sporting CP,
before signing with Manchester United in 2003, aged 18. After winning the FA Cup in his first season,
he helped United win three successive Premier League titles, the UEFA Champions League, and the FIFA Club World Cup"""

To perform named entity recognition with NLTK, you have to perform three steps:

Convert your text to tokens using the word_tokenize() function.
Find parts of speech tag for each word using the pos_tag() function.
Pass the list that contains tuples of words and POS tags to the ne_chunk() function.

The following script performs the first step. It tokenizes the input sentence into individual words using the word_tokenize() method.

from nltk import word_tokenize, pos_tag, ne_chunk

tokens = word_tokenize(sentence)
print(tokens)

Output:

['Born', 'and', 'raised', 'in', 'Madeira', ',', 'Ronaldo', 'began', 'his', 'senior', 'club', 'career', 'playing', 'for', 'Sporting', 'CP', ',', 'before', 'signing', 'with', 'Manchester', 'United', 'in', '2003', ',', 'aged', '18', '.', 'After', 'winning', 'the', 'FA', 'Cup', 'in', 'his', 'first', 'season', ',', 'he', 'helped', 'United', 'win', 'three', 'successive', 'Premier', 'League', 'titles', ',', 'the', 'UEFA', 'Champions', 'League', ',', 'and', 'the', 'FIFA', 'Club', 'World', 'Cup']

Next, to find POS tags, you can pass the list of tokens or words to the pos_tag() method which returns a list containing tuples of tokens and corresponding POS tags.

pos_tags = pos_tag(tokens)
print(pos_tags)

The output below shows the words along with their POS tags. You can find the complete names for the POS tags at this link or by running the command nltk.help.upenn_tagset()

Output:

[('Born', 'NNP'), ('and', 'CC'), ('raised', 'VBN'), ('in', 'IN'), ('Madeira', 'NNP'), (',', ','), ('Ronaldo', 'NNP'), ('began', 'VBD'), ('his', 'PRP$'), ('senior', 'JJ'), ('club', 'NN'), ('career', 'NN'), ('playing', 'NN'), ('for', 'IN'), ('Sporting', 'VBG'), ('CP', 'NNP'), (',', ','), ('before', 'IN'), ('signing', 'VBG'), ('with', 'IN'), ('Manchester', 'NNP'), ('United', 'NNP'), ('in', 'IN'), ('2003', 'CD'), (',', ','), ('aged', 'VBD'), ('18', 'CD'), ('.', '.'), ('After', 'IN'), ('winning', 'VBG'), ('the', 'DT'), ('FA', 'NNP'), ('Cup', 'NNP'), ('in', 'IN'), ('his', 'PRP$'), ('first', 'JJ'), ('season', 'NN'), (',', ','), ('he', 'PRP'), ('helped', 'VBD'), ('United', 'NNP'), ('win', 'VB'), ('three', 'CD'), ('successive', 'JJ'), ('Premier', 'NNP'), ('League', 'NNP'), ('titles', 'NNS'), (',', ','), ('the', 'DT'), ('UEFA', 'NNP'), ('Champions', 'NNP'), ('League', 'NNP'), (',', ','), ('and', 'CC'), ('the', 'DT'), ('FIFA', 'NNP'), ('Club', 'NNP'), ('World', 'NNP'), ('Cup', 'NNP')]

The last step to find named entities with Python, is to pass the list returned by the pos_tag() function to the ne_chunk() function as shown below:

named_entities = ne_chunk(pos_tags)
print(named_entities)

In the output below, you can see named entities along with the words in the form of a tree where the S is the root of the tree. For instance, you can see that “Ronaldo” has been tagged as a person, “Manchester” has also been identified as a person, “UEFA” has been recognized as an organization.

Output:

(S
  (GPE Born/NNP)
  and/CC
  raised/VBN
  in/IN
  (GPE Madeira/NNP)
  ,/,
  (PERSON Ronaldo/NNP)
  began/VBD
  his/PRP$
  senior/JJ
  club/NN
  career/NN
  playing/NN
  for/IN
  Sporting/VBG
  (ORGANIZATION CP/NNP)
  ,/,
  before/IN
  signing/VBG
  with/IN
  (PERSON Manchester/NNP United/NNP)
  in/IN
  2003/CD
  ,/,
  aged/VBD
  18/CD
  ./.
  After/IN
  winning/VBG
  the/DT
  FA/NNP
  Cup/NNP
  in/IN
  his/PRP$
  first/JJ
  season/NN
  ,/,
  he/PRP
  helped/VBD
  (GPE United/NNP)
  win/VB
  three/CD
  successive/JJ
  Premier/NNP
  League/NNP
  titles/NNS
  ,/,
  the/DT
  (ORGANIZATION UEFA/NNP)
  Champions/NNP
  League/NNP
  ,/,
  and/CC
  the/DT
  (ORGANIZATION FIFA/NNP Club/NNP)
  World/NNP
  Cup/NNP)

The above output shows that though NLTK provides functionalities for named entity recognition, it is not very accurate. In the next section, you’ll see how to perform named entity recognition with the Python spaCy library, instead.

Named Entity Recognition with spaCy

The spaCy library also provides named entity recognition functionalities. First you have to import the spaCy library along with the english web model for spaCy as shown in the following script:

import spacy
import en_core_web_sm
spacy_model = en_core_web_sm.load()

To perform named entity recognition, you have to pass the text to the spaCy model object, like this:

entity_doc = spacy_model(sentence)

In this demo, we’re going to use the same sentence defined in our NLTK example.

Next, to find extracted entities, you can use the ents attribute as shown below:

entity_doc.ents

The output shows all the entities extracted from the text.

Output:

(Ronaldo,
 Sporting CP,
 Manchester United,
 2003,
 18,
 the FA Cup,
 his first season,
 United,
 three,
 Premier League,
 the UEFA Champions League,
 the FIFA Club World Cup)

To print the name and type of an extracted entity, you can use the text and label attributes of the individual entity as shown below:

print([(entity.text, entity .label_) for entity in entity_doc.ents])

Output:

[('Ronaldo', 'PERSON'), ('Sporting CP', 'ORG'), ('Manchester United', 'PERSON'), ('2003', 'DATE'), ('18', 'DATE'), ('the FA Cup', 'EVENT'), ('his first season', 'DATE'), ('United', 'ORG'), ('three', 'CARDINAL'), ('Premier League', 'ORG'), ('the UEFA Champions League', 'ORG'), ('the FIFA Club World Cup', 'ORG')]

The above output shows much better results, where “Sporting CP” has been identified as an organization, “the FA Cup” has been identified as an event, “Premier League” has been identified as an organization. However, “Manchester United” is still identified as a person which is not correct. This is an example of why NLP is so challenging, but also so impressive when it works.

For convenience, here’s the full spaCy Python code for named entity recognition:

import spacy
import en_core_web_sm
spacy_model = en_core_web_sm.load()

sentence = """Born and raised in Madeira, Ronaldo began his senior club career playing for Sporting CP,
before signing with Manchester United in 2003, aged 18. After winning the FA Cup in his first season,
he helped United win three successive Premier League titles, the UEFA Champions League, and the FIFA Club World Cup"""

entity_doc = spacy_model(sentence)
entity_doc.ents

print([(entity.text, entity .label_) for entity in entity_doc.ents])

This article shows how to perform named entity recognition with Python NLTK and spaCy. Though both the libraries contain named entity extraction capability, the results obtained are rarely 100 percent correct, just like when you’re training a deep learning model.

If you want better results, you should try some machine learning or deep learning technique to train your own named entity parsers. However that requires high level programming and statistical knowledge. You can subscribe below fore more Python tutorials like this one. Otherwise, here is an article to get you started.

Python Named Entity Recognition with NLTK & spaCy

The Python Tutorials Blog

Installing Required Libraries

Named Entity Recognition with NLTK

Named Entity Recognition with spaCy

About The Python Tutorials Blog