This tutorial explains how to use Python’s Wikipedia Library to extract information from Wikipedia articles. We’ll show you how to extract titles, links, content, summaries, images, references, and more from a Wikipedia page. The Python Wikipedia Library is a fantastic wrapper for the MediaWiki API so it’s really easy to grab the content and data you need from Wikipedia pages.

Installing the Python Wikipedia Library

Before you can scrape Wikipedia pages with the Python Wikipedia library, you need to install the library. Simply use pip installer to install the library by entering the following command in your terminal:

pip install wikipedia

Getting Page Suggestions

The first task we’ll demonstrate in this tutorial is how to return page suggestions based on a searched query. To get suggestions relevant to a particular search query, the search query is passed to the search() method of the Python Wikipedia module. Here’s how you get all the page suggestions for the word “Ronaldo”:

import wikipedia
print(wikipedia.search("Ronaldo"))

Output:

['Cristiano Ronaldo', 'Ronaldo (Brazilian footballer)', 'Ronaldo', 'Ronaldo–Messi rivalry', 'Ronaldo Souza', 'Ronaldinho', 'List of career achievements by Cristiano Ronaldo', 'Ronaldo Oliveira', 'Madeira Airport', 'Ronaldo Bôscoli']

The output is a list showing all the pages that are relevant to the search query “Ronaldo”. You can limit the number of suggestions by passing a number to the result attribute of the search() method, as shown below:

print(wikipedia.search("Ronaldo", results = 3))

Output:

['Cristiano Ronaldo', 'Ronaldo (Brazilian footballer)', 'Ronaldo']

The output shows only the first three suggestion since we specified 3. as the value for the results attribute of the search() method.

If you don’t know the exact spelling of a word, you can still get a suggestion using the suggest() method. In the following script, we pass wrong spellings of “Cristiano Ronaldo” to the suggest() method. Despite our typos, the output returns the correct page spellings.

print(wikipedia.suggest("Cristian Ronadlo"))

Output:

cristiano ronaldo

Getting Article Summary

You can get the summary of an article by passing the article name to the summary() method of the Wikipedia library. The following script returns the summary of the Wikipedia article on Cristiano Ronaldo.

print(wikipedia.summary("cristiano ronaldo"))

Output:

Cristiano Ronaldo dos Santos Aveiro GOIH ComM (European Portuguese: [kɾiʃˈtjɐnu ʁɔˈnaɫdu]; born 5 February 1985) is a Portuguese professional footballer who plays as a forward for Serie A club Juventus and captains the Portugal national team. Often considered the best player in the world and  widely regarded as one of the greatest players of all time, Ronaldo has won five Ballons d'Or and four European Golden Shoes, both of which are records for a European player. He has won 29 trophies in his career, including six league titles, five UEFA Champions Leagues, one UEFA European Championship, and one UEFA Nations League. A prolific goalscorer, Ronaldo holds the records for the most goals scored in the UEFA Champions League (128) and the joint-most goals scored in the UEFA European Championship (9). He is one of the few recorded players to have made over 1,000 professional career appearances and has scored over 700 senior career goals for club and country.Born and raised in Madeira, Ronaldo began his senior club career playing for Sporting CP, before signing with Manchester United in 2003, aged 18. After winning the FA Cup in his first season, he helped United win three successive Premier League titles, the UEFA Champions League, and the FIFA Club World Cup; at age 23, he won his first Ballon d'Or ...

Note: The summary reported above has been truncated.

You can limit the number of sentences in the summary by passing an integer value to the sentences attribute of the summary() method. For instance, the following script returns only the first three sentences of the summary of the Wikipedia article on Cristiano Ronaldo.

print(wikipedia.summary("cristiano ronaldo", sentences = 3))

Output:

Cristiano Ronaldo dos Santos Aveiro GOIH ComM (European Portuguese: [kɾiʃˈtjɐnu ʁɔˈnaɫdu]; born 5 February 1985) is a Portuguese professional footballer who plays as a forward for Serie A club Juventus and captains the Portugal national team. Often considered the best player in the world and  widely regarded as one of the greatest players of all time, Ronaldo has won five Ballons d'Or and four European Golden Shoes, both of which are records for a European player. He has won 29 trophies in his career, including six league titles, five UEFA Champions Leagues, one UEFA European Championship, and one UEFA Nations League.
Can't get enough Python?

Enter your email address for more free Python tutorials and tips.

Python is powerful! Show me more free Python tips

Searching Page Elements

The Python Wikipedia library provides several methods for searching elements within a Wikipedia page. To do so, you need to first create an object of the page class, which isn’t as complicated as it sounds. You can create an object of the page class by passing the page name to the class constructor as shown below.

ronaldo_page = wikipedia.page("cristiano ronaldo")

The above script creates an object of the page class named ronaldo_page. You can give any other name to the object, of course. Using this object, you can gather a variety of page attribute information.

Retrieving Page Title

Let’s first search the title of a Wikipedia page. You can use the title attribute of the page class for this purpose as shown below:

print(ronaldo_page.title)

Output:

Cristiano Ronaldo

Retrieving Page content

To search the text content of a Wikipedia page, you can use the content attribute of the page class. Have a look at the following script:

print(ronaldo_page.content)

Output:

Cristiano Ronaldo dos Santos Aveiro GOIH ComM (European Portuguese: [kɾiʃˈtjɐnu ʁɔˈnaɫdu]; born 5 February 1985) is a Portuguese professional footballer who plays as a forward for Serie A club Juventus and captains the Portugal national team. Often considered the best player in the world and  widely regarded as one of the greatest players of all time, Ronaldo has won five Ballons d'Or and four European Golden Shoes, both of which are records for a European player. He has won 29 trophies in his career, including six league titles, five UEFA Champions Leagues, one UEFA European Championship, and one UEFA Nations League. A prolific goalscorer, Ronaldo holds the records for the most goals scored in the UEFA Champions League (128) and the joint-most goals scored in the UEFA European Championship (9). He is one of the few recorded players to have made over 1,000 professional career appearances and has scored over 700 senior career goals for club and country.Born and raised in Madeira, Ronaldo began his senior club career playing for Sporting CP ...

Note: The page content reported above has been truncated.

Retrieving HTML Content of a Page

The content attribute returns the text content, but you can also return the raw HTML content. To get the raw HTML content of a Wikipedia page, use the html() method of the page class.

print(ronaldo_page.html())

Output:

<div class="mw-parser-output"><div role="note" class="hatnote navigation-not-searchable">Not to be confused with <a href="/wiki/Ronaldo_(Brazilian_footballer)" title="Ronaldo (Brazilian footballer)">Ronaldo (Brazilian footballer)</a>.</div>
<div role="note" class="hatnote navigation-not-searchable">"Cristiano" redirects here. For other people named Cristiano, see <a href="/wiki/Cristiano_(given_name)" title="Cristiano (given name)">Cristiano (given name)</a> and <a href="/wiki/Cristiano_(surname)" title="Cristiano (surname)">Cristiano (surname)</a>.</div>
<p class="mw-empty-elt">
</p>
<div class="shortdescription nomobile noexcerpt noprint searchaux" style="display:none">Portuguese footballer</div>
<p class="mw-empty-elt"> ....

Note: The complete HTML content is not displayed in the output owing to space constraints.

Retrieving URL of a Page

The url attribute of the page object returns the full URL of a Wikipedia page:

print(ronaldo_page.url)

Output:

https://en.wikipedia.org/wiki/Cristiano_Ronaldo

Retrieving References of a Page

A Wikipedia page may contain several references. To retrieve all the references, you can use the references attribute of the page object as shown in the following example.

print(ronaldo_page.references)

Output:

['http://www.espn.com.au/football/blog/marcottis-musings/62/post/3606914/cristiano-ronaldo-conundrum-how-should-max-allegri-use-juventus-star', 'http://theworldgame.sbs.com.au/article/2016/07/23/madeira-airport-renamed-after-cristiano-ronaldo', 'http://www.chapters.indigo.ca/books/Moments-Cristiano-Ronaldo/9780230706699-item.html', 'http://cantic.bnc.cat/registres/CUCId/a11371201', 'http://english.peopledaily.com.cn/200506/13/eng20050613_189948.html' ...

Notice the output is a Python list of reference URLs.

Note: Not all references have been displayed.

Retrieving Titles of Linked Pages

This is a very powerful feature. A Wikipedia page normally contains links to several other Wikipedia Pages. You can retrieve titles of all the pages linking to a particular Wikipedia page using the links attribute. This returns a list of all the internal Wikipedia links pointing to the page your interested in.

print(ronaldo_page.links)

Output:

["1956 Ballon d'Or", "1957 Ballon d'Or", "1958 Ballon d'Or", "1959 Ballon d'Or", "1960 Ballon d'Or", "1960 European Nations' Cup", "1961 Ballon d'Or", "1962 Ballon d'Or", "1963 Ballon d'Or", "1964 Ballon d'Or", "1964 European Nations' Cup", "1965 Ballon d'Or", "1966 Ballon d'Or", "1967 Ballon d'Or" ...

Note: Not all page links have been displayed.

Retrieving Images from a Wikipedia Pages

A Wikipedia page may contain one or multiple images. You can get a list of all images on a Wikipedia page using the images attribute, or you can grab a single image by passing an index to the images attribute. The following Python script returns the link for the 1st image on the Wikipedia page for “Cristiano Ronaldo.”

print(ronaldo_page.images[0])

Output:

https://upload.wikimedia.org/wikipedia/commons/e/e1/1_cristiano_ronaldo_2016.jpg

If you copy and paste the above link in a browser, you should see the image.

Searching for Pages via Geographical Coordinates

With the Python Wikipedia library, you can even search pages via geographical coordinates. The geographical coordinates of Paris are 48.856 and 2.35. Let’s pass these coordinates to the geosearch() method of the Wikipedia module and see which pages we get in the output:

print(wikipedia.geosearch(48.856, 2.35))

Output:

['2018 attack on the Iran Embassy in France', '13 Vendémiaire', "Place de l'Hôtel-de-Ville – Esplanade de la Libération", 'Saint-Denis de La Chartre', 'Siege of Paris (1590)', 'Siege of Paris (845)', 'Battle of Paris (1814)', 'Siege of Paris (1870–71)', 'Siege of Paris (1429)', 'Timeline of Paris']

The output shows all the pages relevant to these particular coordinates in the city of Paris.

Changing Page languages

The Python Wikipedia library also lets you gather the page contents of a Wikipedia page in other languages. To see a list of all the supported language, you can use the following query.

print(wikipedia.languages())

To change the language settings, you need to pass a two digit language code to the set_lang() method of the Wikipedia library. The following script returns the first 3 sentences of an article summary on “Cristiano Ronaldo” in the French language.

wikipedia.set_lang("fr")  
print(wikipedia.summary("cristiano ronaldo", sentences = 3))

Output:

Cristiano Ronaldo dos Santos Aveiro, couramment appelé Ronaldo ou Cristiano Ronaldo et surnommé CR7, né le 5 février 1985 à Funchal sur l'île de Madère, est un footballeur international portugais qui évolue au poste d'attaquant à la Juventus de Turin.
Considéré comme l'un des meilleurs joueurs de l'histoire de son sport, il est le seul footballeur avec Lionel Messi à avoir remporté le Ballon d'or au moins cinq fois : en 2008, 2013, 2014, 2016 et  en 2017. Auteur de plus de 720 buts en carrière, il est le meilleur buteur de la Ligue des champions de l'UEFA, des coupes d'Europe, du Real Madrid, du derby madrilène, de la Coupe du monde des clubs de la FIFA et de la sélection portugaise, dont il est le capitaine depuis 2007.

To get more tips about different Python libraries, subscribe using our form below:

Can't get enough Python?

Enter your email address for more free Python tutorials and tips.

Python is powerful! Show me more free Python tips