Hugging Face Pipelines for Python Natural Language Processing

Hugging Face is an open source data science platform that provides several functionalities for natural language and image processing tasks.

This tutorial explains how to use hugging face pipelines to to help with your natural language processing (NLP) tasks. We’re going to show you how to classify, generate, complete, translate and summarize text using the hugging face pipelines from the transformers library.

What are Hugging Face pipelines

Hugging face pipelines are objects offering APIs that abstract the majority of complex code required to perform advanced natural language processing tasks, like text classification and generation, named entity recognition and question-answering.

The pipeline() function from the transformers module is used to create hugging face pipelines. In this tutorial, we’re going to perform the following tasks using hugging face pipelines:

Sentiment Classification
Token Classification
Text Generation
Text Completion
Text Translation
Text Summarization
Zero Shot Classification
Question Answering

Run the following pip command to install the transformers library from hugging face.

! pip install datasets transformers[sentencepiece]

Sentiment Classification

Sentiment classification is the task of assigning a sentiment to a piece of text. The hugging face pipelines allow you to predict sentiment of a text in a few lines of code.

The hugging face pipeline() function accepts the task to perform as a string parameter value. To perform sentiment classification, you need to pass the sentiment-analysis as the task parameter value. The pipeline() function returns an object that predicts the sentiment of a sentence or multiple sentences passed to it as a parameter.

The following script predicts the sentiment of a single sentence. The input sentence clearly contains a positive sentiment, which is correctly predicted by the sentiment classifier. The output also contains a confidence score for the prediction.

Working with the Default `pipeline()` Function

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
output = classifier("The new movie is fantastic - I really liked it")
print(output)

Output:

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
[{'label': 'POSITIVE', 'score': 0.999884843826294}]

You could also pass a list of sentences to the sentiment classifier pipeline. In the output, you’ll then see a list of results corresponding to each input sentence.

In the following script, the first input sentence is classified as having a positive sentiment, while the second input sentence is predicted to have a negative sentiment.

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
sentences = ["Tennis is my favorite sport",
           "I do not like football much though"]
output = classifier(sentences)
for items in output:
  print(items)

Output:

{'label': 'POSITIVE', 'score': 0.9995262026786804}
{'label': 'NEGATIVE', 'score': 0.9992342591285706}

Passing Model from Hugging Face Hub to a Pipelines

The pipeline() function has a default model for each of the tasks. The default model for the sentiment analysis task is distilbert-base-uncased-finetuned-sst-2-english.

You can classify sentiments with any other text classification model from the hugging face model hub.

To do so, go to the hugging face model hub and select a model of your choice from different categories of model. Copy the name of the model and pass it to the model parameter of the pipeline() function.

For instance, in the script below we pass the finiteautomata/bertweet-base-sentiment-analysis model to the pipeline() function. This model classifies a sentiment into three categories: positive, negative and neutral.

In the output, you can see that the model classifies the three input sentences into one of the positive, negative or neutral categories.

from transformers import pipeline

classifier = pipeline("sentiment-analysis", model = "finiteautomata/bertweet-base-sentiment-analysis")
sentences = ["The movie is fantastic",
           "I did not like the ice cream today",
           "I am visiting Paris today"]
output = classifier(sentences)
print(output)

Output:

[{'label': 'POS', 'score': 0.9904395341873169}, {'label': 'NEG', 'score': 0.9752851724624634}, {'label': 'NEU', 'score': 0.7367361783981323}]

Code More, Distract Less: Support Our Ad-Free Site

You might have noticed we removed ads from our site - we hope this enhances your learning experience. To help sustain this, please take a look at our Python Developer Kit and our comprehensive cheat sheets. Each purchase directly supports this site, ensuring we can continue to offer you quality, distraction-free tutorials.

Get Our Python Dev Kit

The Internal Working of the `pipeline()` Function

There are three main tasks performed by the pipeline() function:

Tokenization
Generating Model Outputs
Post-processing Model output

Tokenization

The first step performed by the pipeline() function is the tokenization of the input text. Tokenization refers to assigning numeric values to the words or characters in the text. We’ve done tokenization before with NLTK.

The hugging face AutoTokenizercan be used to tokenize the input text. You need to pass the hugging face model you want to use for the text tokenization to the from_pretrained() method of the AutoTokenizer.

The tokenizer object returns the numeric representation of the input text, as shown in the following script. It’s important to note that the numeric representation will differ depending on the hugging face model used for tokenization.

from transformers import AutoTokenizer

model_name = "finiteautomata/bertweet-base-sentiment-analysis"
tokenizer = AutoTokenizer.from_pretrained(model_name)

sentences = ["The movie is fantastic",
           "I did not like the ice cream today",
           "I am visiting Paris today"]

inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors = "pt")
print(inputs)

Output:

{'input_ids': tensor([[   0,   47,  566,   17, 2877,    2,    1,    1,    1,    1],
        [   0,    8,  103,   46,   43,    6, 1110, 2074,  128,    2],
        [   0,    8,  155, 6512, 3177,  128,    2,    1,    1,    1]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 0, 0, 0]])}

Generating Model Outputs

The next step is to pass the tokenized input to the hugging face model. The model output depends upon its type. For example, the classification model in the following script returns a tensor of size 3 x 3 corresponding to three input sentences and three output labels. A model returns output in the form of logits.

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_name)
outputs = model(**inputs)
print(outputs.logits)

Output:

tensor([[-2.7136, -1.0891,  3.7312],
        [ 3.2211, -0.6129, -2.3733],
        [-4.0080,  2.4012,  1.3675]], grad_fn=<AddmmBackward>)

Post-processing Model output

Like we just mentioned, a hugging face model returns outputs in the form of logits, but these need further processing depending upon the task we’re trying to accomplish. For instance, for the classification task, the pipeline() function passes the model output to a SoftMax function.

import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

Output:

tensor([[0.0016, 0.0080, 0.9904],
        [0.9753, 0.0211, 0.0036],
        [0.0012, 0.7367, 0.2621]], grad_fn=<SoftmaxBackward>)

The output of the SoftMax function is further passed to the argmax function to make final predictions.

print(torch.argmax(predictions, dim=1))
print(model.config.id2label)

Output:

tensor([2, 0, 1])
{0: 'NEG', 1: 'NEU', 2: 'POS'}

Now that we’ve manually stepped through what’s happening behind the scenes, let’s see some of the other tasks the pipeline() function can perform.

Token classification

Token classification refers to assigning labels to individual tokens (e.g., words inside a string). Named entity recognition is an example of a token classification task. The hugging face pipeline() function can be used to perform named entity recognition. To do so, you need to pass ner as the task name to the function. To detect grouped entities (words that are semantically connected), you can pass True as the value for the grouped_entities parameter.

Here’s an example of how you can perform named entity recognition with the pipeline() function using a simple sentence with 3 named entities.

from transformers import pipeline

named_entities = pipeline("ner", grouped_entities=True)
output = named_entities ("John is working at Google, in New york City.")
for items in output:
  print(items)

Output:

{'entity_group': 'PER', 'score': 0.99814034, 'word': 'John', 'start': 0, 'end': 4}
{'entity_group': 'ORG', 'score': 0.9978345, 'word': 'Google', 'start': 19, 'end': 25}
{'entity_group': 'LOC', 'score': 0.9739967, 'word': 'New york City', 'start': 30, 'end': 43}

Notice how the named entities were all recognized, regardless of capitalization of their proper names.

Text generation

You can pass the text-generation task to the pipeline() function to generate text based on an input text string. Here’s an example.

from transformers import pipeline

generator = pipeline("text-generation")
output = generator("If you look from the top of Eiffel Tower, you will see")
print(output)

Output:

[{'generated_text': 'If you look from the top of Eiffel Tower, you will see a whole section of buildings, with dozens of shops, the main train station and many more that you can buy for a decent price. There is also yet another part of town'}]

It generates text, but, as you can see, it may not make sense or be factually accurate. It’s still fun, though!

Code More, Distract Less: Support Our Ad-Free Site

Get Our Python Dev Kit

Text Completion

In the text completion task, a missing (or masked) value is predicted based on its context. For instance, in the input of the following script, we want to predict the capital of France. You can use the fill-mask task for text completion with the pipeline() function, as shown below. Here the top_k parameter tells the model to return the top two suggestions.

from transformers import pipeline

text_detector = pipeline("fill-mask")
output = text_detector("Eiffel tower is located in <mask> the capital of France", top_k=2)
for items in output:
  print(items)

Output:

{'score': 0.6108735799789429, 'token': 2201, 'token_str': ' Paris', 'sequence': 'Eiffel tower is located in Paris the capital of France'}
{'score': 0.03842312470078468, 'token': 6497, 'token_str': ' Brussels', 'sequence': 'Eiffel tower is located in Brussels the capital of France'}

Text Translation

You can use the translation task for text translation with the pipeline() function, too.

As an exmple, the following script translates input text from Spanish to English using the Helsinki-NLP/opus-mt-es-en text translation model.

from transformers import pipeline

translator = pipeline("translation", model = "Helsinki-NLP/opus-mt-es-en")
output = translator("Me encanta comer pasteles y frutas.")
print(output)

Output:

[{'translation_text': 'I love eating cakes and fruits.'}]

Text Summarization

The summarization task from the pipeline() function can generate a summary of an input text, as shown in the following example.

Note: The input text is taken from the Eiffel Tower Wikipedia Page.

from transformers import pipeline

text_summarizer = pipeline("summarization")
output = text_summarizer(
    """
The design of the Eiffel Tower is attributed to Maurice Koechlin and Émile Nouguier, two senior engineers working for the Compagnie des Établissements Eiffel. It was envisioned after discussion about a suitable centerpiece for the proposed 1889 Exposition Universelle, a world's fair to celebrate the centennial of the French Revolution. Eiffel openly acknowledged that inspiration for a tower came from the Latting Observatory built in New York City in 1853. In May 1884, working at home, Koechlin made a sketch of their idea, described by him as "a great pylon, consisting of four lattice girders standing apart at the base and coming together at the top, joined together by metal trusses at regular intervals". Eiffel initially showed little enthusiasm, but he did approve further study, and the two engineers then asked Stephen Sauvestre, the head of the company's architectural department, to contribute to the design. Sauvestre added decorative arches to the base of the tower, a glass pavilion to the first level, and other embellishments.
First drawing of the Eiffel Tower by Maurice Koechlin including size comparison with other Parisian landmarks such as Notre Dame de Paris, the Statue of Liberty and the Vendôme Column
The new version gained Eiffel's support: he bought the rights to the patent on the design which Koechlin, Nougier, and Sauvestre had taken out, and the design was put on display at the Exhibition of Decorative Arts in the autumn of 1884 under the company name. On 30 March 1885, Eiffel presented his plans to the Société des Ingénieurs Civils; after discussing the technical problems and emphasising the practical uses of the tower, he finished his talk by saying the tower would symbolise
"""
)
print(output)

Output:

[{'summary_text': " The design of the Eiffel Tower is attributed to Maurice Koechlin and Émile Nouguier . It was envisioned after discussion about a suitable centerpiece for the 1889 Exposition Universelle, a world's fair to celebrate the centennial of the French Revolution . The two engineers then asked Stephen Sauvestre, the head of the company's architectural department, to contribute to the design ."}]

Zero Shot classification

Zero shot classification refers to classifying previously unseen data.

The pipeline() function with the help of the zero-shot-classification task is capable of classifying unlabeled data. You need to pass the input text and the possible output labels to the pipeline, which automatically infers the relationship between the input and the output label.

For example, the following script correctly classifies the input text into the “sports” category. Though the model is not trained to classify an input text into one of the “sports”, “politics” or “cinema” categories, the model nevertheless infers the relation between the input text and the appropriate output label.

from transformers import pipeline

zs_classifier = pipeline("zero-shot-classification")
output = classifier(
                    "England is the Cricket World Champion",
                    candidate_labels=["sports", "politics", "cinema"],
                  )
print(output)

Output:

{'sequence': 'England is the Cricket World Champion', 'labels': ['sports', 'cinema', 'politics'], 'scores': [0.9976509213447571, 0.0015886325854808092, 0.0007604386773891747]}

Question answering

The pipeline() function, using the question-answering task, allows you to answer questions based on some context. For instance, the following script correctly predicts “Old Trafford” as the answer to the question “What is the name of the home ground of Manchester United.”

from transformers import pipeline

answer_model = pipeline("question-answering")
output = answer_model(
    question="What is the name of the home ground of Manchester United",
    context="Manchester United plays their home matches at Old Trafford",
)
print(output)

Output:

{'score': 0.9809378981590271, 'start': 46, 'end': 58, 'answer': 'Old Trafford'}

Natural language processing is getting more sophisticated each day. Hopefully this tutorial helped shed some light on what the technology is capable of using open-sourced Hugging Face pipelines. For more NLP project tips, enter your email address in the form below.

Code More, Distract Less: Support Our Ad-Free Site

Get Our Python Dev Kit

Hugging Face Pipelines for Python Natural Language Processing

The Python Tutorials Blog

What are Hugging Face pipelines

Sentiment Classification

Working with the Default `pipeline()` Function

Passing Model from Hugging Face Hub to a Pipelines

The Internal Working of the `pipeline()` Function

Tokenization

Generating Model Outputs

Post-processing Model output

Token classification

Text generation

Text Completion

Text Translation

Text Summarization

Zero Shot classification

Question answering

About The Python Tutorials Blog

What are Hugging Face pipelines

Sentiment Classification

Working with the Default pipeline() Function

Passing Model from Hugging Face Hub to a Pipelines

The Internal Working of the pipeline() Function

Tokenization

Generating Model Outputs

Post-processing Model output

Token classification

Text generation

Text Completion

Text Translation

Text Summarization

Zero Shot classification

Question answering

About The Python Tutorials Blog

Working with the Default `pipeline()` Function

The Internal Working of the `pipeline()` Function