Named entity recognition refers to identifying named entities from text. Named entities are real world objects, like people, products, locations, and dates. For example, “Eiffel Tower” is a named entity since it refers to a specific tower located in Paris. Similarly, “Christiano Ronaldo” is a named entity because it refers to a person. Automatic named entity recognition from text is one of the most important tasks in Natural Language Processing (NLP) since it helps you identify what the text is all about and what person, places, objects, organizations or entities are being discussed in the text.
In this tutorial, we’re going to show you exactly how to perform named entity recognition using Python’s NLTK and spaCy libraries.
Installing Required Libraries
To run the scripts in this tutorial, you’ll first need to install the NLTK and spaCy Libraries. Execute the following script on your command terminal to install the NLTK Library.
pip install nltk
Similarly, execute the following script on your command terminal to install the spaCy library.
pip install spacy
To download and install all the modules and objects required to support the NLTK library, you’ll need to run the following command inside your Python application:
import nltk
nltk.download('words')
Named Entity Recognition with NLTK
Python’s NLTK library contains a named entity recognizer called ne_chunk()
function of the NLTK library. As an example, in this section, we’ll try to find named entities from the following sentence.
sentence = """Born and raised in Madeira, Ronaldo began his senior club career playing for Sporting CP,
before signing with Manchester United in 2003, aged 18. After winning the FA Cup in his first season,
he helped United win three successive Premier League titles, the UEFA Champions League, and the FIFA Club World Cup"""
To perform named entity recognition with NLTK, you have to perform three steps:
- Convert your text to tokens using the
word_tokenize()
function. - Find parts of speech tag for each word using the
pos_tag()
function. - Pass the list that contains tuples of words and POS tags to the
ne_chunk()
function.
The following script performs the first step. It tokenizes the input sentence into individual words using the word_tokenize()
method.
from nltk import word_tokenize, pos_tag, ne_chunk
tokens = word_tokenize(sentence)
print(tokens)
Output:
['Born', 'and', 'raised', 'in', 'Madeira', ',', 'Ronaldo', 'began', 'his', 'senior', 'club', 'career', 'playing', 'for', 'Sporting', 'CP', ',', 'before', 'signing', 'with', 'Manchester', 'United', 'in', '2003', ',', 'aged', '18', '.', 'After', 'winning', 'the', 'FA', 'Cup', 'in', 'his', 'first', 'season', ',', 'he', 'helped', 'United', 'win', 'three', 'successive', 'Premier', 'League', 'titles', ',', 'the', 'UEFA', 'Champions', 'League', ',', 'and', 'the', 'FIFA', 'Club', 'World', 'Cup']
Next, to find POS tags, you can pass the list of tokens or words to the pos_tag()
method which returns a list containing tuples of tokens and corresponding POS tags.
pos_tags = pos_tag(tokens)
print(pos_tags)
The output below shows the words along with their POS tags. You can find the complete names for the POS tags at this link or by running the command nltk.help.upenn_tagset()
Output:
[('Born', 'NNP'), ('and', 'CC'), ('raised', 'VBN'), ('in', 'IN'), ('Madeira', 'NNP'), (',', ','), ('Ronaldo', 'NNP'), ('began', 'VBD'), ('his', 'PRP$'), ('senior', 'JJ'), ('club', 'NN'), ('career', 'NN'), ('playing', 'NN'), ('for', 'IN'), ('Sporting', 'VBG'), ('CP', 'NNP'), (',', ','), ('before', 'IN'), ('signing', 'VBG'), ('with', 'IN'), ('Manchester', 'NNP'), ('United', 'NNP'), ('in', 'IN'), ('2003', 'CD'), (',', ','), ('aged', 'VBD'), ('18', 'CD'), ('.', '.'), ('After', 'IN'), ('winning', 'VBG'), ('the', 'DT'), ('FA', 'NNP'), ('Cup', 'NNP'), ('in', 'IN'), ('his', 'PRP$'), ('first', 'JJ'), ('season', 'NN'), (',', ','), ('he', 'PRP'), ('helped', 'VBD'), ('United', 'NNP'), ('win', 'VB'), ('three', 'CD'), ('successive', 'JJ'), ('Premier', 'NNP'), ('League', 'NNP'), ('titles', 'NNS'), (',', ','), ('the', 'DT'), ('UEFA', 'NNP'), ('Champions', 'NNP'), ('League', 'NNP'), (',', ','), ('and', 'CC'), ('the', 'DT'), ('FIFA', 'NNP'), ('Club', 'NNP'), ('World', 'NNP'), ('Cup', 'NNP')]
The last step to find named entities with Python, is to pass the list returned by the pos_tag()
function to the ne_chunk()
function as shown below:
named_entities = ne_chunk(pos_tags)
print(named_entities)
In the output below, you can see named entities along with the words in the form of a tree where the
Output:
(S
(GPE Born/NNP)
and/CC
raised/VBN
in/IN
(GPE Madeira/NNP)
,/,
(PERSON Ronaldo/NNP)
began/VBD
his/PRP$
senior/JJ
club/NN
career/NN
playing/NN
for/IN
Sporting/VBG
(ORGANIZATION CP/NNP)
,/,
before/IN
signing/VBG
with/IN
(PERSON Manchester/NNP United/NNP)
in/IN
2003/CD
,/,
aged/VBD
18/CD
./.
After/IN
winning/VBG
the/DT
FA/NNP
Cup/NNP
in/IN
his/PRP$
first/JJ
season/NN
,/,
he/PRP
helped/VBD
(GPE United/NNP)
win/VB
three/CD
successive/JJ
Premier/NNP
League/NNP
titles/NNS
,/,
the/DT
(ORGANIZATION UEFA/NNP)
Champions/NNP
League/NNP
,/,
and/CC
the/DT
(ORGANIZATION FIFA/NNP Club/NNP)
World/NNP
Cup/NNP)
The above output shows that though NLTK provides functionalities for named entity recognition, it is not very accurate. In the next section, you’ll see how to perform named entity recognition with the Python spaCy library, instead.
Get Our Python Developer Kit for Free
I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.
Named Entity Recognition with spaCy
The spaCy library also provides named entity recognition functionalities. First you have to import the spaCy library along with the english web model for spaCy as shown in the following script:
import spacy
import en_core_web_sm
spacy_model = en_core_web_sm.load()
To perform named entity recognition, you have to pass the text to the spaCy model object, like this:
entity_doc = spacy_model(sentence)
In this demo, we’re going to use the same
Next, to find extracted entities, you can use the
entity_doc.ents
The output shows all the entities extracted from the text.
Output:
(Ronaldo,
Sporting CP,
Manchester United,
2003,
18,
the FA Cup,
his first season,
United,
three,
Premier League,
the UEFA Champions League,
the FIFA Club World Cup)
To print the name and type of an extracted entity, you can use the
print([(entity.text, entity .label_) for entity in entity_doc.ents])
Output:
[('Ronaldo', 'PERSON'), ('Sporting CP', 'ORG'), ('Manchester United', 'PERSON'), ('2003', 'DATE'), ('18', 'DATE'), ('the FA Cup', 'EVENT'), ('his first season', 'DATE'), ('United', 'ORG'), ('three', 'CARDINAL'), ('Premier League', 'ORG'), ('the UEFA Champions League', 'ORG'), ('the FIFA Club World Cup', 'ORG')]
The above output shows much better results, where “Sporting CP” has been identified as an organization, “the FA Cup” has been identified as an event, “Premier League” has been identified as an organization. However, “Manchester United” is still identified as a person which is not correct. This is an example of why NLP is so challenging, but also so impressive when it works.
For convenience, here’s the full spaCy Python code for named entity recognition:
import spacy
import en_core_web_sm
spacy_model = en_core_web_sm.load()
sentence = """Born and raised in Madeira, Ronaldo began his senior club career playing for Sporting CP,
before signing with Manchester United in 2003, aged 18. After winning the FA Cup in his first season,
he helped United win three successive Premier League titles, the UEFA Champions League, and the FIFA Club World Cup"""
entity_doc = spacy_model(sentence)
entity_doc.ents
print([(entity.text, entity .label_) for entity in entity_doc.ents])
This article shows how to perform named entity recognition with Python NLTK and spaCy. Though both the libraries contain named entity extraction capability, the results obtained are rarely 100 percent correct, just like when you’re training a deep learning model.
If you want better results, you should try some machine learning or deep learning technique to train your own named entity parsers. However that requires high level programming and statistical knowledge. You can subscribe below fore more Python tutorials like this one. Otherwise, here is an article to get you started.
Get Our Python Developer Kit for Free
I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.