Web Scraping with Python Scrapy

Python is one of the best programming languages you can use for web scraping, and the Python Scrapy library makes it exceptionally easy. Web scraping is the process of programmatically extracting data from websites. It’s particularly useful when you want to extract a large amount of data from a website or to automate a process that would be too tedious to perform manually.

In this tutorial, we’ll show you how to scrape names, links and movie ratings from IMDB. We’ll store our scraped data in a CSV file.

Installing Scrapy Library for Python

Before, you can perform web scraping with Scrapy, you need to install the Scrapy library. To do so, execute the following command on your command prompt:

$ pip install Scrapy

Alternatively, if you are using Anaconda’s distribution for Python, execute the following command on your Anaconda prompt:

$ conda install -c conda-forge scrapy

Scraping IMDB Movie Records with Python

Data to be Scraped

We’re going to be scraping our data from IMDB’s list of top 250 movies.

movie page

Creating Scrapy Project

To scrape data with Scrapy, you have to create a Scrapy project. To do so, execute the following command on your command prompt inside the directory where you want to create the project. The code below creates a Scrapy project named imdb_records.

scrapy startproject imdb_records

Once you execute the above script, you’ll see a message similar this one on your command prompt.

Output:

New Scrapy project 'imdb_records', using template directory 'c:\programdata\anaconda3\lib\site-packages\scrapy\templates\project', created in:
    E:\IMDB_Records\imdb_records

You can start your first spider with:
    cd imdb_records
    scrapy genspider example example.com

Inside the directory where you created your imdb_records Scrapy project, you should see a new folder called “imdb_records” with the following directory structure:

Python Scrapy folder

Creating Spiders for scraping

Once you’ve created a Scrapy project, the next step is to create a spider that will crawl the website you want to scrape. The script below creates a spider named imdb_spider that crawls the IMDB Top 250 webpage. Note that the following script should be executed inside the “spiders” folder in your “imdb_records” project.

scrapy genspider imdb_spider www.imdb.com/search/title/?groups=top_250

If your spider is successfully created, you should see the following output:

Output:

Created spider 'imdb_spider' using template 'basic' in module:
  imdb_records.spiders.imdb_spider

Code More, Distract Less: Support Our Ad-Free Site

You might have noticed we removed ads from our site - we hope this enhances your learning experience. To help sustain this, please take a look at our Python Developer Kit and our comprehensive cheat sheets. Each purchase directly supports this site, ensuring we can continue to offer you quality, distraction-free tutorials.

Get Our Python Dev Kit

Scraping Movie Records

Once you create a spider, a Python file named “imdb_spider” will appear in your “spiders” folder. The new Python script will look like this:

import scrapy

class ImdbSpiderSpider(scrapy.Spider):
    name = 'imdb_spider'
    allowed_domains = ['imdb.com']
    start_urls = ["https://www.imdb.com/search/title/?groups=top_250"]

    def parse(self, response):
        pass

In the above file, the variable allowed_domains is a list of domains that will be searched for your desired data and the start_urls variable is a list of urls where the spider starts searching for the data you want. The data that you want scraped is specified inside the parse() method.

All the data in a webpage resides inside HTML tags. Scrapy basically searches HTML tags for your desired data, just like we showed you how to do in our VBA web scraping tutorial. Because Scrapy blindly crawls the HTML, you need to have a basic knowledge of HTML in order to know which HTML elements in a webpage contain your desired data.

One of the ways to familiarize yourself with the HTML elements in a website is to open the website in a browser, right click the webpage and click “Inspect”. For example, if you open the IMDB movie page we’re trying to scrape in Chrome, then “Right Click -> Inspect”, you’ll see HTML code like this:

Inspect Element in Chrome

From the HTML elements on the right, you can see that movie names are inside anchor tags, which are inside <h3> tags. These tags are further nested in a div tag with a class “lister-item-content”.

With Scrapy, there are two ways to search data inside an HTML page: You can either use CSS Selectors or you can use XPath Selectors. In this tutorial, we’ll use XPath selectors. Let’s briefly explain the functionality of each “XPATH” selector.

First, modify the “imdb_spider.py” file you generated earlier such that it looks like this:

import scrapy


class ImdbSpiderSpider(scrapy.Spider):
    name = 'imdb_spider'
    allowed_domains = ['imdb.com']
    start_urls = ["https://www.imdb.com/search/title/?groups=top_250"]

    def parse(self, response):

        movies = response.xpath("//div[@class='lister-item-content']")

        for movie in movies:
            movie_name = movie.xpath(".//h3//a//text()").get()
            movie_link = "https://www.imdb.com" + movie.xpath(".//h3//a//@href").get()
            movie_rating = movie.xpath(".//div[@class='ratings-bar']//strong//text()").get()

            yield{
                    'movie name':movie_name,
                    'movie link':movie_link,
                    'movie_rating':movie_rating
                  }

In the above script the response object fetches all the <div> tags with class “lister-item-content”. Next, a loop iterates over the div tags. To fetch the movie name in each iteration of the loop, the div tag fetched by the response object is appended with .//h3//a//text(). These strange sequence of characters tells Scrapy that inside the h3 tag, search for the anchor tag and retrieve the text of the anchor tag which basically contains movie names. Similarly to fetch the title, the “XPATH” query .//h3//a//@href is appended with the div tag. Finally, ratings are retrieved by appending .//div[@class='ratings-bar']//strong//text() query with the div tag retrieved by the response object.

I know this sounds a little complicated, but once you’re comfortable with the patterns in the HTML layout of the webpage you’re trying to scrape, it will start to make more sense.

The last step is to execute the parse method of the “imdb_spider.py” file and store the scraped data in a CSV file. To do so, execute the following script.

scrapy crawl imdb_spider -o imdb_movies.csv

Once you execute the above script, you’ll see that a CSV file named imdb_movies.cs is created inside the “spiders” folder of the imdb_records project.

CSV file path

If you open the CSV file in Excel or any spreadsheet software, you should see that it contains a clean list of all the names, links and movie ratings you just scraped from the website.

CSV file output

That’s all there is to it! As you can see, scrapy is one of the easiest and most reliable ways to rapidly extract data from a website using Python. It’s a whole lot easier than programming it with VBA! Join our team using the form below to see more creative ways you can use Python to handle data like we did in this tutorial.