This tutorial explains how to do topic modeling with the BERT transformer using the BERTopic library in Python.
Topic modeling refers to the use of statistical techniques for extracting abstracts topics within the text. Several topic modeling approaches exists, including Latent Dirichlet Allocation (LDA) and Non-negative matrix factorization. However, transformer architectures, like BERT, are known to outperform all the existing baselines for topic modeling.
In this tutorial, we’re going implement the BERT model using the BERTopic library to extract topics from tweets related to US Presidential Elections 2020 (yikes!).
Install the BERTopic Library.
You can use the pip command to install the BERTopic library. If you do not want visualization functionalities, you can install the BERTopic library with the following command:
!pip install bertopic
If you want visualization functionalities to be installed with BERTopic, go ahead and run the the following pip command:
!pip install bertopic[visualization]
Import the Dataset
We’re going to be using the US Election 2020 Tweets dataset from Kaggle in this tutorial. Download the dataset and unzip it. The dataset contains two CSV files: hashtag_donaldtrump.csv
and hashtag_joebiden.csv
, where the former CSV file contains tweets that mention #DonaldTrump and #Trump, while the later contains tweets that mention #JoeBiden and #Biden as keywords. These are large datasets so we’re not going to mirror them here, but you can create a free Kaggle account to download them yourself.
We’ll be be working with the hashtag_donaldtrump.csv
file in this tutorial. You can use the other file if you want.
The following script imports the dataset:
import pandas as pd
import numpy as np
tweets_trump = pd.read_csv("/content/us-election-2020-tweets/hashtag_donaldtrump.csv", engine='python')
Let’s print the dataset size and its first five rows:
print(tweets_trump.shape)
tweets_trump.head()
Output:
The output shows that the dataset contains more then 970K tweets. The dataset has 21 columns in total. However, we’re only interested in the tweet
column, which contains text of each tweet.
For computational reasons we’re just going extract topics from the first 50K tweets in the dataset.
tweets_trump_short = tweets_trump[:50000]
Code More, Distract Less: Support Our Ad-Free Site
You might have noticed we removed ads from our site - we hope this enhances your learning experience. To help sustain this, please take a look at our Python Developer Kit and our comprehensive cheat sheets. Each purchase directly supports this site, ensuring we can continue to offer you quality, distraction-free tutorials.
Create Topic Model with BERTopic
The first step for creating a Topic Model with BERTopic is to import the BERTopic
class from the bertopic
module.
from bertopic import BERTopic
NOTE: At the time of this tutorial, a keyword error was encountered when importing the BERTopic library related to the library hdbscan, which is installed with BERTtopic. An argument, cachedir
was removed from joblib.Memory
function in the joblib library (version 1.2.0), but this change hasn’t been reflected in the hdbscan library in PyPi (though it has been updated in GitHub). Until it gets updated in PyPi, the solution is to downgrade joblib to version 1.1.0 using the code below:
!pip install joblib==1.1.0
The BERTopic
class expects text data in the form of a list of strings containing documents. Therefore, we convert the tweet
column from our input dataset to a list of strings.
tweet_list = tweets_trump_short["tweet"].tolist()
Next we’ll create an instance of the BERTopic
class. Since the dataset we’re using contains tweets in multiple languages, let’s go ahead and use the multilingual BERT model.
Finally, to train the model and extract topics, you need to pass the list of documents (tweets, in our case) to the fit_transform()
method of the BERTopic
class.
topic_model = BERTopic(language="multilingual")
topics, probs = topic_model.fit_transform(tweet_list)
You can use the get_topic_info()
method to print information about the extracted topics. This method returns a Pandas dataframe with three columns:
- Topic: which contains the id of the topic. Note, the topic id -1 is reserved for the outlier topics which contain tweets that could not be assigned to any other topic.
- Count: Refers to the number of tweets belonging to a particular topic
- Name: The automatically generated topic label. It’s important to mention that these labels are automatically generated. Ideally, you should analyze the words within a given topic and replace the automatically generated labels with better semantic labels.
topic_model.get_topic_info().head(10)
Output:
The above output shows the top 10 topics with the highest number of tweets in the dataset. To get your bearings from here, notice that topic #1 is about social media censorship and topic #5 is about California wildfires. The topic column is different from the bold Pandas index position in the far left column.
You can also get the most commonly occurring words within a topic using the get_topic()
method. The script below shows the most commonly occurring words within topic id 1. Based on the resulting output, you can clearly see that topic 1 is indeed related to social media censorship.
topic_model.get_topic(1)
Output:
[('twitter', 0.0245202247992172),
('censorship', 0.014119126944339979),
('facebook', 0.013714167221682292),
('account', 0.01071044953638521),
('twittercensorship', 0.009144698655008785),
('cuenta', 0.006393199223631192),
('jack', 0.00616044263073788),
('censoring', 0.005659671021833165),
('social', 0.005453005316660101),
('tweet', 0.005112612403601746)]
You can also get a sample of representative documents for a topic. For instance, the script below shows three tweets that belong to topic 1.
topic_model.get_representative_docs(1)
Output:
['Way to go #Twitter!\n#Twitter locks official #Trump campaign account \nOver sharing #HunterBiden video https://t.co/bJYwYnr5QJ #FoxNews \n\n#JoeBiden\n#TwitterCensorship\n#TedCruz #LindseyGraham\n#FoxNews #RudyGiuliani👎👎👎\n#Congress #Republican #Senate \n#CNN #FoxNews #MSNBC \n#Facebook',
'#Trump/Pence2020 #4MoreYears #voteinperson #NoMailInVoting #VoterFraud #VoterID #Backlaw #protecttheConstitution \n#FREEDOMOFSPEACH spoken and written \nTime to stop Censorship. https://t.co/UFIu3YQBuQ',
'@SouthwestAir More censoring of #Trump #BlackVoicesforTrump #Trump2020 #DoubleStandard #Apologize https://t.co/dzIQD7Qn7d']
Finally, The BERTopic libary lets you find topics containing a specific term. For instance, the script below returns the five most probable topics that contain the term censor
. The probabilities of occurrence of the term within a topic is also printed.
topic_model.find_topics("censor")
Output:
([1, 303, 370, 43, 573],
[0.6236835438684443,
0.5472915885730678,
0.5323406451203374,
0.5217645602427222,
0.5126032860291148])
Code More, Distract Less: Support Our Ad-Free Site
You might have noticed we removed ads from our site - we hope this enhances your learning experience. To help sustain this, please take a look at our Python Developer Kit and our comprehensive cheat sheets. Each purchase directly supports this site, ensuring we can continue to offer you quality, distraction-free tutorials.
Visualize Topics
The BERTopic
library provides several nice visualization functions. For instance, you can use the visualize_barchart()
function to plot a bar plot for the most frequently occurring words within a topic. Here’s an example:
topic_model.visualize_barchart()
Output:
You can also plot a scatter plot that shows the size and the distance between various topics. The visualize_topics()
function does that.
topic_model.visualize_topics()
Output:
A topic could be a sub-topic of another topic. You can use the visualize_hierarchy()
function to visualize this type of topic hierarchy, as shown in the following script:
topic_model.visualize_hierarchy()
Output:
Finally, you can plot a heatmap that shows similarities between multiple topics using the visualize_heatmap()
function.
You can hover over the heatmap to see similarity scores between different topics. For example, in the output below you can see a similarity score of 0.713 between the topic 1 and topic 69.
topic_model.visualize_heatmap()
Output:
Predicting Topics for Unseen Documents
You can assign previously identified topics to a new document using the transform()
method of the BERTopic
class.
For instance, the following script classifies to sample input string to topic 1 (social media censorship), which is the appropriate classification.
text = ["The government plans to put a ban on Twitter in some areas"]
topics, probabilites = topic_model.transform(text)
print(topics)
Output:
[1]
Saving and Loading a BERTopic Topic Model
Once you train your topic model using BERTopic, you can save it for later use with the help of the save()
method as shown below:
topic_model.save("tweets_topic_model")
Similarly, you can load an already saved model using the load()
function, as demonstrated in the following script:
loaded_topic_model = BERTopic.load("tweets_topic_model")
The BERTopic library is really great for grouping or classifying documents, or in this case tweets. If you found this tutorial helpful and you want more tips like this, enter your email address in the form below!
Code More, Distract Less: Support Our Ad-Free Site
You might have noticed we removed ads from our site - we hope this enhances your learning experience. To help sustain this, please take a look at our Python Developer Kit and our comprehensive cheat sheets. Each purchase directly supports this site, ensuring we can continue to offer you quality, distraction-free tutorials.