This tutorial explains how to use Python’s Wikipedia Library to extract information from Wikipedia articles. We’ll show you how to extract titles, links, content, summaries, images, references, and more from a Wikipedia page. The Python Wikipedia Library is a fantastic wrapper for the MediaWiki API so it’s really easy to grab the content and data you need from Wikipedia pages.
Installing the Python Wikipedia Library
Before you can scrape Wikipedia pages with the Python Wikipedia library, you need to install the library. Simply use pip installer to install the library by entering the following command in your terminal:
pip install wikipedia
Getting Page Suggestions
The first task we’ll demonstrate in this tutorial is how to return page suggestions based on a searched query. To get suggestions relevant to a particular search query, the search query is passed to the search()
method of the Python Wikipedia module. Here’s how you get all the page suggestions for the word “Ronaldo”:
import wikipedia
print(wikipedia.search("Ronaldo"))
Output:
['Cristiano Ronaldo', 'Ronaldo (Brazilian footballer)', 'Ronaldo', 'Ronaldo–Messi rivalry', 'Ronaldo Souza', 'Ronaldinho', 'List of career achievements by Cristiano Ronaldo', 'Ronaldo Oliveira', 'Madeira Airport', 'Ronaldo Bôscoli']
The output is a list showing all the pages that are relevant to the search query “Ronaldo”. You can limit the number of suggestions by passing a number to the result
attribute of the search()
method, as shown below:
print(wikipedia.search("Ronaldo", results = 3))
Output:
['Cristiano Ronaldo', 'Ronaldo (Brazilian footballer)', 'Ronaldo']
The output shows only the first three suggestion since we specified results
attribute of the search()
method.
If you don’t know the exact spelling of a word, you can still get a suggestion using the suggest()
method. In the following script, we pass wrong spellings of “Cristiano Ronaldo” to the suggest()
method. Despite our typos, the output returns the correct page spellings.
print(wikipedia.suggest("Cristian Ronadlo"))
Output:
cristiano ronaldo
Getting Article Summary
You can get the summary of an article by passing the article name to the summary()
method of the Wikipedia library. The following script returns the summary of the Wikipedia article on Cristiano Ronaldo.
print(wikipedia.summary("cristiano ronaldo"))
Output:
Cristiano Ronaldo dos Santos Aveiro GOIH ComM (European Portuguese: [kɾiʃˈtjɐnu ʁɔˈnaɫdu]; born 5 February 1985) is a Portuguese professional footballer who plays as a forward for Serie A club Juventus and captains the Portugal national team. Often considered the best player in the world and widely regarded as one of the greatest players of all time, Ronaldo has won five Ballons d'Or and four European Golden Shoes, both of which are records for a European player. He has won 29 trophies in his career, including six league titles, five UEFA Champions Leagues, one UEFA European Championship, and one UEFA Nations League. A prolific goalscorer, Ronaldo holds the records for the most goals scored in the UEFA Champions League (128) and the joint-most goals scored in the UEFA European Championship (9). He is one of the few recorded players to have made over 1,000 professional career appearances and has scored over 700 senior career goals for club and country.Born and raised in Madeira, Ronaldo began his senior club career playing for Sporting CP, before signing with Manchester United in 2003, aged 18. After winning the FA Cup in his first season, he helped United win three successive Premier League titles, the UEFA Champions League, and the FIFA Club World Cup; at age 23, he won his first Ballon d'Or ...
Note: The summary reported above has been truncated.
You can limit the number of sentences in the summary by passing an integer value to the sentences
attribute of the summary()
method. For instance, the following script returns only the first three sentences of the summary of the Wikipedia article on Cristiano Ronaldo.
print(wikipedia.summary("cristiano ronaldo", sentences = 3))
Output:
Cristiano Ronaldo dos Santos Aveiro GOIH ComM (European Portuguese: [kɾiʃˈtjɐnu ʁɔˈnaɫdu]; born 5 February 1985) is a Portuguese professional footballer who plays as a forward for Serie A club Juventus and captains the Portugal national team. Often considered the best player in the world and widely regarded as one of the greatest players of all time, Ronaldo has won five Ballons d'Or and four European Golden Shoes, both of which are records for a European player. He has won 29 trophies in his career, including six league titles, five UEFA Champions Leagues, one UEFA European Championship, and one UEFA Nations League.
Code More, Distract Less: Support Our Ad-Free Site
You might have noticed we removed ads from our site - we hope this enhances your learning experience. To help sustain this, please take a look at our Python Developer Kit and our comprehensive cheat sheets. Each purchase directly supports this site, ensuring we can continue to offer you quality, distraction-free tutorials.
Searching Page Elements
The Python Wikipedia library provides several methods for searching elements within a Wikipedia page. To do so, you need to first create an object of the page
class, which isn’t as complicated as it sounds. You can create an object of the page
class by passing the page name to the class constructor as shown below.
ronaldo_page = wikipedia.page("cristiano ronaldo")
The above script creates an object of the page class named ronaldo_page
. You can give any other name to the object, of course. Using this object, you can gather a variety of page attribute information.
Retrieving Page Title
Let’s first search the title of a Wikipedia page. You can use the title
attribute of the page
class for this purpose as shown below:
print(ronaldo_page.title)
Output:
Cristiano Ronaldo
Retrieving Page content
To search the text content of a Wikipedia page, you can use the content
attribute of the page
class. Have a look at the following script:
print(ronaldo_page.content)
Output:
Cristiano Ronaldo dos Santos Aveiro GOIH ComM (European Portuguese: [kɾiʃˈtjɐnu ʁɔˈnaɫdu]; born 5 February 1985) is a Portuguese professional footballer who plays as a forward for Serie A club Juventus and captains the Portugal national team. Often considered the best player in the world and widely regarded as one of the greatest players of all time, Ronaldo has won five Ballons d'Or and four European Golden Shoes, both of which are records for a European player. He has won 29 trophies in his career, including six league titles, five UEFA Champions Leagues, one UEFA European Championship, and one UEFA Nations League. A prolific goalscorer, Ronaldo holds the records for the most goals scored in the UEFA Champions League (128) and the joint-most goals scored in the UEFA European Championship (9). He is one of the few recorded players to have made over 1,000 professional career appearances and has scored over 700 senior career goals for club and country.Born and raised in Madeira, Ronaldo began his senior club career playing for Sporting CP ...
Note: The page content reported above has been truncated.
Retrieving HTML Content of a Page
The content
attribute returns the text content, but you can also return the raw HTML content. To get the raw HTML content of a Wikipedia page, use the html()
method of the page
class.
print(ronaldo_page.html())
Output:
<div class="mw-parser-output"><div role="note" class="hatnote navigation-not-searchable">Not to be confused with <a href="/wiki/Ronaldo_(Brazilian_footballer)" title="Ronaldo (Brazilian footballer)">Ronaldo (Brazilian footballer)</a>.</div>
<div role="note" class="hatnote navigation-not-searchable">"Cristiano" redirects here. For other people named Cristiano, see <a href="/wiki/Cristiano_(given_name)" title="Cristiano (given name)">Cristiano (given name)</a> and <a href="/wiki/Cristiano_(surname)" title="Cristiano (surname)">Cristiano (surname)</a>.</div>
<p class="mw-empty-elt">
</p>
<div class="shortdescription nomobile noexcerpt noprint searchaux" style="display:none">Portuguese footballer</div>
<p class="mw-empty-elt"> ....
Note: The complete HTML content is not displayed in the output owing to space constraints.
Retrieving URL of a Page
The url
attribute of the page
object returns the full URL of a Wikipedia page:
print(ronaldo_page.url)
Output:
https://en.wikipedia.org/wiki/Cristiano_Ronaldo
Retrieving References of a Page
A Wikipedia page may contain several references. To retrieve all the references, you can use the references
attribute of the page
object as shown in the following example.
print(ronaldo_page.references)
Output:
['http://www.espn.com.au/football/blog/marcottis-musings/62/post/3606914/cristiano-ronaldo-conundrum-how-should-max-allegri-use-juventus-star', 'http://theworldgame.sbs.com.au/article/2016/07/23/madeira-airport-renamed-after-cristiano-ronaldo', 'http://www.chapters.indigo.ca/books/Moments-Cristiano-Ronaldo/9780230706699-item.html', 'http://cantic.bnc.cat/registres/CUCId/a11371201', 'http://english.peopledaily.com.cn/200506/13/eng20050613_189948.html' ...
Notice the output is a Python list of reference URLs.
Note: Not all references have been displayed.
Retrieving Titles of Linked Pages
This is a very powerful feature. A Wikipedia page normally contains links to several other Wikipedia Pages. You can retrieve titles of all the pages linking to a particular Wikipedia page using the links
attribute. This returns a list of all the internal Wikipedia links pointing to the page your interested in.
print(ronaldo_page.links)
Output:
["1956 Ballon d'Or", "1957 Ballon d'Or", "1958 Ballon d'Or", "1959 Ballon d'Or", "1960 Ballon d'Or", "1960 European Nations' Cup", "1961 Ballon d'Or", "1962 Ballon d'Or", "1963 Ballon d'Or", "1964 Ballon d'Or", "1964 European Nations' Cup", "1965 Ballon d'Or", "1966 Ballon d'Or", "1967 Ballon d'Or" ...
Note: Not all page links have been displayed.
Retrieving Images from a Wikipedia Pages
A Wikipedia page may contain one or multiple images. You can get a list of all images on a Wikipedia page using the images
attribute, or you can grab a single image by passing an index to the images
attribute. The following Python script returns the link for the 1st image on the Wikipedia page for “Cristiano Ronaldo.”
print(ronaldo_page.images[0])
Output:
https://upload.wikimedia.org/wikipedia/commons/e/e1/1_cristiano_ronaldo_2016.jpg
If you copy and paste the above link in a browser, you should see the image.
Searching for Pages via Geographical Coordinates
With the Python Wikipedia library, you can even search pages via geographical coordinates. The geographical coordinates of Paris are 48.856
and 2.35
. Let’s pass these coordinates to the geosearch()
method of the Wikipedia
module and see which pages we get in the output:
print(wikipedia.geosearch(48.856, 2.35))
Output:
['2018 attack on the Iran Embassy in France', '13 Vendémiaire', "Place de l'Hôtel-de-Ville – Esplanade de la Libération", 'Saint-Denis de La Chartre', 'Siege of Paris (1590)', 'Siege of Paris (845)', 'Battle of Paris (1814)', 'Siege of Paris (1870–71)', 'Siege of Paris (1429)', 'Timeline of Paris']
The output shows all the pages relevant to these particular coordinates in the city of Paris.
Changing Page languages
The Python Wikipedia library also lets you gather the page contents of a Wikipedia page in other languages. To see a list of all the supported language, you can use the following query.
print(wikipedia.languages())
To change the language settings, you need to pass a two digit language code to the set_lang()
method of the Wikipedia library. The following script returns the first 3 sentences of an article summary on “Cristiano Ronaldo” in the French language.
wikipedia.set_lang("fr")
print(wikipedia.summary("cristiano ronaldo", sentences = 3))
Output:
Cristiano Ronaldo dos Santos Aveiro, couramment appelé Ronaldo ou Cristiano Ronaldo et surnommé CR7, né le 5 février 1985 à Funchal sur l'île de Madère, est un footballeur international portugais qui évolue au poste d'attaquant à la Juventus de Turin. Considéré comme l'un des meilleurs joueurs de l'histoire de son sport, il est le seul footballeur avec Lionel Messi à avoir remporté le Ballon d'or au moins cinq fois : en 2008, 2013, 2014, 2016 et en 2017. Auteur de plus de 720 buts en carrière, il est le meilleur buteur de la Ligue des champions de l'UEFA, des coupes d'Europe, du Real Madrid, du derby madrilène, de la Coupe du monde des clubs de la FIFA et de la sélection portugaise, dont il est le capitaine depuis 2007.
To get more tips about different Python libraries, subscribe using our form below:
Code More, Distract Less: Support Our Ad-Free Site
You might have noticed we removed ads from our site - we hope this enhances your learning experience. To help sustain this, please take a look at our Python Developer Kit and our comprehensive cheat sheets. Each purchase directly supports this site, ensuring we can continue to offer you quality, distraction-free tutorials.