The aim of this tutorial is to extract data from an Excel sheet and load the data into an SQLite database using Python. We will use the sqlite3 standard module, and the pandas and xrld third-party libraries. Since data can be stored in different ways in an Excel sheet, we will address three case studies: a plain Excel sheet, an Excel sheet with named ranges, and pivot tables.


The pandas and xlrd Modules

Before trying to read data from an Excel sheet with Python, we have to introduce the pandas and xlrd libraries. In order to make the algorithms in the case studies more readable, we’re going to define two classes, namely ExcelDocument and ExcelSheet.

Setting up your Execution Environment

Since both pandas and xlrd are third-party libraries, we have to integrate them in our Python environment. You can install them using the following commands:

pythonexec -m pip install pandas
pythonexec -m pip install xlrd

where pythonexec is the name of the executable file for the Python interpreter, e.g. python3.6, python3.7, and the like. The -m option of this command executes the pip script and passes some arguments to it. Take a look at the official documentation for a complete guide on using pip. After installing all required packages, both libraries can be used in a Python module with the appropriate import statements:

import pandas
import xlrd

Mapping Excel Types into Python Types

Much like SQLite, Excel stores the data type along with the value of a cell. In xlrd, cells are represented by a xlrd.sheet.Cell object, whose type is ctype, and whose value is value. The following table lists all data types that can be stored in a cell, and the corresponding Python type of the cell value:

Type Constant Python Type
XL_CELL_EMPTY Empty String
XL_CELL_TEXT Unicode String
XL_CELL_NUMBER float
XL_CELL_DATE float
XL_CELL_BOOLEAN int, 1 stands for True, 0 stands for False.
XL_CELL_ERROR int representing an error code. Use error_text_from_code() to get the respective error message.
XL_CELL_BLANK Empty String.

As we have already pointed out in another tutorial, all SQLite data types are represented by just five Python types, as indicated in the following table:

Python Type SQLite Type
None NULL
int Integer
float Real
str Text
bytes BLOB

If we compare the right column of the previous table with the right column of this table, we can see that all Excel types can be represented by an SQLite type, but beware that:

  • All numbers in Excel are represented as float objects, even if they don’t have a decimal part. There are cases when an int type is required instead, as we will see in an example below.
  • All dates in Excel are represented as float objects. The next subsection is devoted to the problem of converting between various date formats.

Mapping Dates

The Date type actually stores both date and time as a float object. The xlrd.xldate module provides many functions for converting from and to float dates:

Function Description
xldate_as_tuple(xldate, datemode) Converts a float date into a (year, month, day, hour, minutes, seconds) tuple. Then this tuple can be fed to datetime.datetime(*tuple).
xldate_as_datetime(xldate, datemode) Converts a float date into a datetime.datetime object.
xldate.xldate_from_datetime(datetime_tuple, datemode) Converts a date and time tuple into an Excel float. The datemode parameter should be set to 0 if the date is computed from 1900-01-01, it should be set to 1 if the date is computed from 1904-01-01.
xldate_from_date_tuple(date_tuple, datemode) Converts a (year, month, day) tuple into an Excel float. It equals xldate_from_datetime_tuple(date_tuple+(0, 0, 0), datemode).
xldate_from_time_tuple(time_tuple, datemode) Converts a (hour, minutes, seconds) tuple into an Excel float. It equals xldate_from_datetime_tuple((1899, 12, 31)+time_tuple, 0) or xldate_from_datetime_tuple((1904, 1, 1)+time_tuple, 1), depending on the datemode parameter.

The following examples make use of all these functions:

import xlrd
import datetime

def test_xlrd_datetime_functions():
    """ Test all function from the `xlrd.xldate` module. """
    # Get the current date and time
    today_ms = datetime.datetime.now()
    today = datetime.datetime(*today_ms.utctimetuple()[ : 6])
    # Converts a `tuple` date into a `float` date
    fdate = xlrd.xldate.xldate_from_date_tuple(today.utctimetuple()[ : 3], 0)
    # Converts a `tuple` time into a `float` time
    ftime = xlrd.xldate.xldate_from_time_tuple(
        (today.hour, today.minute, today.second))
    # Converts a `tuple` date and time into a `float` date and time
    fdatetime = xlrd.xldate.xldate_from_datetime_tuple(
        today.utctimetuple()[ : 6], 0)
    # Converts a `float` date into a tuple
    tdate = xlrd.xldate.xldate_as_tuple(fdatetime, 0)
    # Converts a `float` date into a `datetime.datetime` object.
    date = xlrd.xldate.xldate_as_datetime(fdatetime, 0)
    print('Today is %s' % today)
    print('float date is %.3f (%s)' \
        % (fdate, xlrd.xldate.xldate_as_datetime(fdate, 0)))
    print('float time is %.3f (%s)' \
        % (ftime, xlrd.xldate.xldate_as_datetime(ftime, 0)))
    print('float datetime is %3.f' % fdatetime)
    print('date tuple is ', tdate)
    print('today == date? %s' % (today == date))
    print('today == datetime.datetime(*tdate)? %s' \
        % (today == datetime.datetime(*tdate)))

Let’s have a closer look at the code:

  • the today_ms variable holds the current date and time as a datetime.datetime object. This is an immutable object, so once it has been created, you can’t change any of the date or time components.
  • the utctimetuple() method returns the date and time as a (year, month, day, hour, minute, second, microsecond) tuple. - the today date is just like the today_ms date, but we stripped the microsecond component, since all functions in the xlrd.xldate module require time objects with just hour, minute, and second components.
  • the fdate date is created from a (year, month, day) tuple, which we get from the utctimetuple() method. The time is set to 00:00:00 am. In all our examples we will set the datemode parameter of the various xldate functions to 0, meaning that all dates will be computed starting form 1900-01-01.
  • the ftime time is created from a (hour, minute, second) tuple, which we get from the properties with the same names of the today object. The date is set to 1899-12-31, the day before the first valid date in this date mode.
  • the fdatetime time is created from the first 6 items of the tuple returned by the utctimetuple() method.
  • the tdate tuple can be used to create an object that must equal to today, i.e. today == datetime.datetime(*tdate).
  • the date object must also equal to today, since it has been created using the same date and time components.

A possible output of test_xlrd_datetime_functions() is:

Today is 2019-04-04 09:22:25
float date is 43559.000 (2019-04-04 00:00:00)
float time is 0.391 (1899-12-31 09:22:25)
float datetime is 43559.391
date tuple is  (2019, 4, 4, 9, 22, 25)
today == date? True
today == datetime.datetime(*tdate)? True

Regardless of the current date and time, both tests should evaluate to True.

Now, we know how to represent an Excel date as a Python string or float. On the SQLite side, there isn’t a DATE type, but dates can be stored in the following ways:

  • as a TEXT, using ISO-8601 strings.
  • as a REAL, using Julian day numbers.
  • as an INTEGER, using the Unix Time format.

An ISO-8601 date is in the format YYYY-MM-DDThh:mm:ss.sTZD, where:

  • YYYY is a four-digit year.
  • MM is a two-digit month (1-12).
  • DD is a two-digit day.
  • T separates the date from the the time. It must be omitted if the time is omitted.
  • hh is a two-digit hour.
  • mm is a two-digit minute (0-59).
  • ss is a two-digit second (0-59). It can be omitted.
  • s is a decimal fraction of a second. It can be omitted.
  • TZD can be either the letter Z, meaning the it is an UTC (Coordinated Universal Time) time, or an offset from the UTC time +hh:mm or -hh:mm; e.g. 2019-04-04T19:20:30.45+01:00 indicates the 4th April 2019, at 8:30 pm, 1 hour ahead the UTC time.

SQLite provides various scalar functions to convert between these formats. In all examples in this tutorial, we will stick to the string representation of dates. As we will see below, the ExcelSheet class will automatically take care of the conversion between Excel float dates and string dates.

Representing an Excel Document with the ExcelDocument Class

Now, we will describe two classes, ExcelDocument, which represents an Excel document, and ExcelSheet, which represents a single sheet within an Excel workbook. These classes will do most of the hard work, allowing us to describe the algorithms for our case studies in a straightforward way. Both classes are located in the exceldoc.py Python module, which has been provided as an attachment to this tutorial.

The ExcelDocument Class

import itertools

import pandas
import pandas.io.excel as excel
import xlrd


class ExcelDocument(excel.ExcelFile):
    """ Wrapper around the `excel.ExcelFile` class. """
    
    def __getitem__(self, name):
        """ Gets a sheet by name. """
        if name in self.sheet_names:
            return ExcelSheet(self.book.sheet_by_name(name))
        else:
            raise KeyError('Invalid sheet name "%s"' % name)
            
    def __iter__(self):
        """ Iterates over the sheets in the document. """
        for name in self.sheet_names:
            yield ExcelSheet(self.book.sheet_by_name(name))
            
    def __len__(self):
        """ Number of sheets in the document. """
        return self.book.nsheets
      
    @property
    def global_names(self):
        """ Lists all global names. """
        return { name : self.book.name_map[name] \
            for name, scope in self.book.name_and_scope_map \
            if scope == -1 }

Get Our Python Developer Kit for Free

I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.

Yes, I'll take a free Python Developer Kit

This class adds the following methods to its base class:

  • the __getitem__(name) method allows access to all Excel sheets in a workbook by name. It can raise a KeyError exception if there isn’t any sheet by that name; e.g. doc['Sheet1'] will return an ExcelSheet object for the sheet named Sheet1.
  • the __iter__() method iterates over the sheets in the workbook.
  • the __len__() method returns the number of sheets in the workbook.
  • the global_names property returns a dictionary mapping all global names in the Excel document to their values. We will discuss global and local names below.

We won’t give any example of this class right now, but we will put it in good use when we will address our case studies.

The ExcelFile Class

The base class ExcelFile has the following noteworthy methods and attributes:

  • To create a new object from this class, you must provide the path to the Excel file, e.g. ExcelFile('doc.xls') creates a new object for the doc.xls Excel file.
  • the book read-only attribute represents an Excel workbook, collecting all sheets in the file. book.nsheets is the number of sheets. Its sheet_by_name(name) method retrieves a sheet object by its name. name_map and name_and_scope_map are two dictionary holding named ranges and functions, which can be accessed by name or by a (name, scope) tuple, respectively. We will tell more about named ranges below.

Representing an Excel Sheet with the ExcelSheet Class

class ExcelSheet:
    """ Class that represents an Excel sheet. """
    
    def __init__(self, sheet, has_header=True):
        
        def find_sheet_idx(sheet):
            idx = -1
            for sh in sheet.book.sheets():
                idx += 1
                if sh.name == sheet.name:
                    break
            if idx == -1:
                raise ValueError("Couldn't locate sheet %s." % sheet.name)
                
            return idx
            
        self._sheet = sheet
        self._has_header = bool(has_header)
        self._sheet_idx = find_sheet_idx(sheet)
        
    def iter_rows(self):
        """ Iterates over the rows of data, skipping the header. """
        start = 1 if self._has_header else 0
        for r in range(start, self._sheet.nrows):
            yield [ xlrd.xldate.xldate_as_datetime(c.value, 0).date() \
                if c.ctype == xlrd.XL_CELL_DATE else c.value \
                for c in self._sheet.row(r) ]
                
    def data_frame(self):
        """ Build a `pandas` data frame from all named ranges. """
        
        def column_from_range(named_range):
            sh, rlo, rhi, clo, chi = named_range.area2d()
            return [ sh.cell(r, clo).value \
                for r in range(rlo, rhi) ]
                
        ranges = self.named_ranges
        return pandas.DataFrame(
            { name : column_from_range(ranges[name]) \
            for name in ranges })
                
    def iter_named_ranges(self):
        """ Iterates over rows of data using named ranges. """
        
        def iter_single_range(named_range):
            sh, rlo, rhi, clo, chi = named_range.area2d()
            for r in range(rlo, rhi):
                yield sh.cell(r, clo).value
                
        ranges = self.named_ranges
        header = tuple(ranges.keys())
        for row in itertools.zip_longest(
            *[ iter_single_range(ranges[name]) for name in ranges ]):
            yield { key : value for key, value in zip(header, row) }
    
    @property
    def name(self):
        return self._sheet.name
        
    @property
    def nrows(self):
        return self._sheet.nrows
        
    @property
    def ncols(self):
        return self._sheet.ncols
        
    def __getitem__(self, n):
        """ Returns the `n`-th row. """
        if not (0 <= n < self.nrows):
            raise ValueError('0 >= row > %d, but %d given.'
                % (self.nrows, n))
        return self._sheet.row(n)
        
    def __iter__(self):
        """ Iterates over all cells left to right, top to bottom. """
        for r in range(self.nrows):
            for c in range(self.ncols):
                yield self._sheet.row(r)[c].value
       
    @property
    def local_names(self):
        """ List all local names. """
        return { name : self._sheet.book.name_and_scope_map[(name, scope)] \
            for name, scope in self._sheet.book.name_and_scope_map \
            if scope == self._sheet_idx }
            
    @property
    def named_ranges(self):
        """ List of all named ranges for this sheet. """
        names = self.local_names
        return { name: names[name] \
            for name in names \
            if names[name].result.kind in (xlrd.oREF, xlrd.oREL) }

This class adds the following method to the underlying class:

  • the __init__(sheet, has_header=True) method builds a new object. Since the official documentation strongly advises against subclassing the xlrd.sheet.Sheet class, we will use this workaround: we will store the sheet object, that all methods in the ExcelSheet class will access as needed. This method has an optional has_header parameter, which specifies if the first row of data should be regarded as a header, i.e. the list of all column names. The find_sheet_idx() function maps a sheet name to its 0-based index into the sheets’ list. We need this index just to assign a named range or function to the appropriate scope, as we will see below.
  • the iter_rows() method iterates over the rows of the Excel sheet. It skips the first row if the excel sheet has a header. It is used mainly to feed data to the executemany() method, as we will see below.
  • the data_frame() method builds a pandas.DataFrame object holding all columns of the local named ranges. See case study 3 for an example of its usage.
  • the iter_named_ranges() method iterates over all named ranges, building a dictionary for each row of data. The iter_single_range() function builds a column of cells from a named range. The itertools.zip_longest() function creates a row by extracting a cell from each column. If a column is too short, it fills the remaining cells with a None value. Finally, it uses a dictionary comprehension to map all column names to their value. In case study 2, the iter_named_ranges() method will be called by the executemany() method to build an INSERT statement from a template by replacing named placeholders.
  • the name property returns the name of the sheet.
  • the nrows and ncols properties return, respectively, the number of rows and columns in the Excel sheet, including the header.
  • the __getitem__(n) method returns the n-th row from the Excel sheet. Both rows and columns are indexed starting from 0; e.g. sheet[r][c] will return the value of the cell located in the r-th row and c-th column.
  • the __iter__() method iterates over the cell values from top to bottom, from left to right, including the header.
  • the local_names property returns a dictionary mapping all local names of an Excel sheet to their values.
  • the local_ranges property returns a dictionary mapping all local names referring to a range into their absolute range coordinates. We will deal with local names below.

Get Our Python Developer Kit for Free

I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.

Yes, I'll take a free Python Developer Kit

The xlrd.sheet.Sheet Class

These are the methods and attributes of the xlrd.sheet.Sheet that will be useful for us:

  • the nrows and ncols read-only attributes are, respectively, the number of rows and the number of columns of the Excel sheet.
  • the name attribute is the name of the Excel sheet.
  • The row(n) method returns the n-th row of the sheet, where n must be between 0 and nrows - 1.
  • The cell(r, c) method retrieves the cell on the r-th row and c-th column.

Detecting Local Named Ranges

In Excel, you can assign a name to either a formula or a range. A name can be used only within its scope. A name can have one of the following scopes:

  • global: (-1) the name can be used by all the sheets in the workbook into which it is defined.
  • macro: (-2) the name can be used only inside a macro or VBA module. We won’t deal with these kinds of names here.
  • local: (from 0 to book.nsheets - 1) the name can be used only inside a sheet.

The ExcelDocument.global_names property detects all global names, while the ExcelSheet.local_names property lists all names belonging to a specific sheet. The ExcelSheet.named_ranges property selects all local names that are named ranges. Let’s take a closer look at them, since they should make more sense now:

@property
def local_names(self):
    return { name : self._sheet.book.name_and_scope_map[(name, scope)] \
        for name, scope in self._sheet.book.name_and_scope_map \
        if scope == self._sheet_idx }
            
@property
def named_ranges(self):
    """ List of all named ranges for this sheet. """
    names = self.local_names
    return { name: names[name] \
        for name in names \
        if names[name].result.kind in (xlrd.oREF, xlrd.oREL) }
  • the local_names property builds a dict object from a dictionary comprehension, selecting all names whose scope matches the sheet index.
  • the named_ranges property selects all names whose value is of type xlrd.oREF or xlrd.oREL, i.e. the value is a range.

The following table lists all possible values of the kind attribute:

Numeric Constant Symbolic Constant Python Represenation
-2 xlrd.oREL None, or a non-empty list of fully or partially relative Ref3D instances.
-1 xlrd.oREF None, or a non-empty list of absolute Ref3D instances.
0 xlrd.oSTRG An Unicode string.
1 xlrd.oUNK The type of the result is unknown or ambiguous. Its value is None.
2 xlrd.oNUM A float.
3 xlrd.oBOOL An integer, with 0 meaning False, and 1 meaning True.
4 xlrd.oERR None, or an integer error code.
5 xlrd.oMSNG Placeholder for missing data. Its value is None.

Many of these types are also Excel data types. Ranges have an area2d() method, that returns a (sheet, rlo, rhi, clo, chi) tuple, where:

  • sheet is the sheet to which the range belongs.
  • rlo and rhi are, respectively, the lowest and highest row index.
  • clo and chi are, respectively, the lowest and highest column index.

So if we want to iterate over the cell of the range from top to bottom, and from left to right, we can use the following loop:

sh, rlo, rhi, clo, cli = name.area2d()
for r in range(rlo, rhi):
    for c in range(clo, chi):
        yield sh.cell(r, c)

Both the ExcelSheet.iter_named_ranges() and the ExcelSheet.data_frame() methods use similar loops.


After setting up the basic tools of the trade, it is time to put them into practice. All our case studies will share a common pattern:

  • The Shortcomings of this Approach section explains where our algorithm for extracting data from an Excel sheet falls short.
  • The Data Layout section describes all columns’ names and data type for our sample data.
  • The Header Detection section describes how columns’ names are detected by the ExcelSheet class.
  • The Implementation section gives an example of how to extract data from an Excel file and load them into an SQLite table.

Case Study 1: A Simple Excel Sheet

The easiest case for importing into a database is that of a spreadsheet where the first row lists all column names, and all other rows contain data.

Shortcomings of this Approach

This approach has several shortcomings, depending both on the way data is presented in the Excel file, and on our particular implementation:

  • The algorithm assumes that the first row stores all column names.
  • The algorithm doesn’t check if all rows have the same number of columns. It will pad all remaining cells with None values.
  • It doesn’t check if the number of columns in the Excel source matches the number of columns in the SQLite table. This may lead to an sqlite3.Error exception while trying to insert Excel data into an SQLite table.
  • It doesn’t check if the Excel data type matches the SQLite data type. The only type conversion performed implicitly is from float dates to datetime.datetime dates, as we have seen above
  • The executemany() method uses positional placeholders instead of named placeholders, which are less error-prone, and which allows a more free arrangement of the columns in the Excel source. We will show an example of named placeholders in case study 2.

Data Layout

In the following table we will list the name and data type of both the Excel sheets and the SQLite table. The Excel sample can be found in the albums.xlsx file, where each sheet stores an album. The SQLite database is located in the albums.db3 file, where all albums will be stored in the albums table. Again, all supporting files can be downloaded in this ZIP file.

Name Excel Type SQLite Type Description
id INTEGER PRIMARY KEY The Song ID.
nr XL_CELL_NUMBER INTEGER NOT NULL The track number.
band XL_CELL_TEXT TEXT NOT NULL The band's name.
song XL_CELL_TEXT TEXT NOT NULL The song's title.
album XL_CELL_TEXT TEXT NOT NULL The album's title.
duration XL_CELL_TEXT TEXT NOT NULL The song's duration, in minutes:seconds format.

The id column on the SQLite schema has no Excel counterpart, since the song’s ID is not stored in the Excel sheets, but it is automatically generated while loading data in the table.

Header Detection

The ExcelSheet class assumes the first row of data holds the header. The header will be skipped when iterating over the data by using the ExcelSheet.iter_rows() method.

Implementation

We will now implement an algorithm for extracting data from a basic Excel sheet and importing it into a sqlite database using Python pandas and xlrd (via our custom classes in the exceldoc.py file).

import sqlite3
from exceldoc import * 

def case_study_1():
    """ Test for case study 1. """
    # Open the database connection and the Excel file
    with sqlite3.connect('albums.db3') as db, \
        ExcelDocument('albums.xlsx') as src:
        insert_template = "INSERT INTO albums " \
            "(nr, band, song, album, duration) " \
            "VALUES (?, ?, ?, ?, ?);"
            
        # Clear the database
        db.execute('DELETE FROM albums;')
        
        # Load data from each Excel sheet into the database
        for sheet in src:
            try:
                db.executemany(insert_template, sheet.iter_rows())
            except sqlite3.Error as e:
                print(e)
                db.rollback()
            else:
                db.commit()
        
        # Check if all data have been loaded
        select_stmt = 'SELECT DISTINCT band, album FROM albums;'
        for row in db.execute(select_stmt).fetchall():
            print(';'.join(row))

Get Our Python Developer Kit for Free

I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.

Yes, I'll take a free Python Developer Kit

Let’s have a closer look at this example:

  • First, we create an SQLite connection for the albums.db3 database, and we open the albums.xlsx file, where all albums data are stored.
  • The insert_template string stores the INSERT statement template used to load data into the SQLite table. The Excel data will be matched to the corresponding SQLite column using the positional placeholder ?.
  • The DELETE FROM albums; statement deletes all data from the albums table.
  • Then, we iterate over each sheet in the albums.xlsx workbook, assigning the current sheet to sheet. The executemany() method will replace each ? placeholder of the insert_template statement with the corresponding field of a row from the iterator returned by the sheet.iter_rows() method. If the insertion succeeds, then all changes to the database will be committed, otherwise they will be discarded.
  • Finally, we will check if all data have been loaded correctly by asking SQLite to list all bands and albums in the albums table.

The output of the program should be:

Amaranthe;The Nexus
Amberian Dawn;Innuendo
Within Temptation;The Unforgiving
Epica;The Quantum Enigma
Sleeping Romance;Alba
Unleash the Archers;Apex
Metalite;Heroes in Time

Case Study 2: An Excel Sheet with Named Ranges

A slightly more elaborate case is that of an Excel sheet where each column is defined as a local named range. There may be a header in the first row, just to make it more human readable, but this algorithm will ignore it.

Shortcomings of this Approach

  • This algorithm works only for XLS files. That’s where things starts to get really messy. As of the time of writing (xlrd version 1.1.0), there is a little difference between reading *.xls and *.xlsx files. The API provides a convenient Name.area2d() method to return an absolute reference to the cells within a named range, but it doesn’t work for *.xlsx files. The API provides a Name.formula_text attribute for XLSX files, which stores the named ranges as a qualified range literal; e.g. innuendo!$C$1:$C$12 represents all cells from C1 to C12 of the innuendo sheet. These references can be parsed using regular expressions, but this approach would clutter the algorithm, so we won’t address it here.
  • it doesn’t check for overlapping columns ranges.
  • it doesn’t check if all ranges have the same number of rows. It will pad shorter columns with None values.
  • it doesn’t check if a range has more that one column, it just uses the leftmost column.
  • it assumes that all column names from the Excel sheet will match a column name in the SQLite schema. This may lead to an sqlite3.Error exception while trying to insert Excel data into an SQLite table.

Data Layout

The data layout is the same as in our first case study. The sample SQLite database is located in the albums-named.db3 file. The sample Excel source is located in the albums-named.xlsx file. There are only three sheets, and all sheets have no header in the first row. If you haven’t already downloaded it, all supporting files for this case study are in this ZIP file.

Header Detection

This algorithm gives no particular meaning to the first row of each Excel sheet, because it relies on named ranges to determine each column’s name. It will iterate over the ExcelSheet.named_ranges objects, and it will build rows of data from them using the ExcelSheet.iter_named_ranges() method.

Implementation

We will now implement the algorithm for extracting data from an Excel sheet which complies with all properties listed above. The extracted Excel data will be stored in a database using the sqlite3 module of Python.

import sqlite3
from exceldoc import *

def case_study_2():
    """ Test for case study 2. """
    # Open the database connection and the Excel file
    with sqlite3.connect('albums-named.db3') as db, \
        ExcelDocument('albums-named.xls') as src:
        insert_template = "INSERT INTO albums " \
            "(nr, band, song, album, duration) " \
            "VALUES (:nr, :band, :song, :album, :duration);"
            
        # Clear the database
        db.execute('DELETE FROM albums;')
        
        # Load data from each Excel sheet into the database
        for sheet in src:
            try:
                db.executemany(insert_template, sheet.iter_named_ranges())
            except sqlite3.Error as e:
                print(e)
                db.rollback()
            else:
                db.commit()
        
        # Check if all data have been loaded
        select_stmt = 'SELECT DISTINCT band, album FROM albums;'
        for row in db.execute(select_stmt).fetchall():
            print(';'.join(row))

This example closely resembles that of the first case study, but it is worth noting that:

  • Once again, we create a database connection for the albums-named.db3 file, and we open the albums-named.xls Excel file.
  • We delete all data in the albums table.
  • The template for the INSERT statement is slightly different, since it employs named placeholders instead of positional placeholders. Named placeholders requires a dictionary instead of a tuple to match the row fields to the table columns.
  • We iterate over the sheets in the workbook, and we use executemany(stmt, iterable), but this time the iterable parameter must be an iterator of dictionaries, provided by the iter_named_ranges() method of the sheet (see above).
  • Finally, we will check if all data have been loaded correctly by asking SQLite to list all bands and albums in the albums table of our newly created database.

Let’s see an example of how the replacement takes place in the executemany() method. The first row of the innuendo sheet is:

Nr Band Song Album Duration
1 Amberian Dawn Fame & Gloria Innuendo 4:24

This is the SQL statement before and after the substitution:

/* Statement with named placeholders */
INSERT INTO albums (nr, band, song, album, duration)
VALUES (:nr, :band, :song, :album, :duration);

/* Statement after the replacement */
INSERT INTO albums (nr, band, song, album, duration)
VALUES (1, 'Amberian Dawn', 'Fame & Gloria', 'Innuendo', '4:24');

The output of the case_study_2() function should be:

Amaranthe;The Nexus
Amberian Dawn;Innuendo
Within Temptation;The Unforgiving

Case Study 3: An Excel Pivot Table

The xlrd library has no way of detecting pivot tables, so in this section we will actually try to simulate an Excel pivot table by using the pandas.pivot_table() function. Once you have the table, you can add it to your database with python sqlite3.

Shortcomings of this Approach

  • This algorithm doesn’t actually read the pivot table, it just uses its data source to emulate the pivot table in pandas.
  • Since header detection works the same way as for case study 2, it can only be used with XLS files.
  • Since it uses named ranges to detect the headers of the source sheet, all shortcomings of case study 2 will also apply to this approach.

Data Layout

The pivot.xls file (download file) has two sheets: pivot-data, which contains the data source for the pivot table, and pivot-table, which contains the pivot table. The pivot-data table has the following schema:

Name Excel Type SQLite Type Description
Date XL_CELL_DATE TEXT NOT NULL Date in YYYY-MM-DD format.
Seller XL_CELL_TEXT TEXT NOT NULL Name of the seller.
Income XL_CELL_NUMBER INTEGER NOT NULL Daily income.
Month XL_CELL_NUMBER INTEGER NOT NULL Month when the sale took place.

The pivot table uses the Seller column as row field, the Month column as column field, and the Income column as data field. It uses SUM() as an aggregate function for the Income data.

Header Detection

The header is detected using named ranges. The header names are used as keys for each column of data in the pandas.DataFrame object.

Implementation

We will now implement the algorithm for extracting data from an Excel sheet which complies with all properties listed above.

import pandas
from exceldoc import *

def case_study_3():
    """ Test for case study 3. """
    with ExcelDocument('pivot.xls') as src:
        table = pandas.pivot_table(
            src['pivot-data'].data_frame(),
            columns = 'month',
            values = 'income',
            index = 'seller',
            aggfunc = {'income' : sum},
            fill_value = 0 )
        print(table)

The output of the case_study_3() function is:

month     1.0   2.0   3.0   4.0
seller                         
Annie       0  2196  4736     0
Ian         0  3352  7806     0
John     1758  3086  5850   545
Mark      746  5038  6221  2299
Matthew   703  3936  2130  2896
Nina     1164  2371  5359   855

which resembles the contents of the pivot-table sheet of the pivot.xls file, as you can see from this (LibreOffice) screenshot:

The Pivot Table from the pivot-table sheet


Closing Thoughts

In this tutorial we learned how to extract data from Excel sheets with different data layouts and import the data into a sqlite database using Python.

First, we described the Excel native data types, with particular emphasis on the date type. Then we became acquainted with the pandas and xlrd third-party libraries, admittedly learning as little as needed to serve our purposes.

We also presented two powerful classes, ExcelDocument and ExcelSheet, in order to simplify some implementation details (notably on named range’s detection), and to make our algorithms for data extraction easier to read.

Finally, we provided two algorithms for reading data from a simple Excel sheet and from a sheet with named ranges. Though the xlrd library doesn’t provide support for pivot tables, we even showed how to emulate them by using the pandas.pivot_table() function.

Did you find this free tutorial helpful? Share this article with your friends, classmates, and coworkers on Facebook and Twitter! When you spread the word on social media, you’re helping us grow so we can continue to provide free tutorials like this one for years to come.


Get Our Python Developer Kit for Free

I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.

Yes, I'll take a free Python Developer Kit

References

  • pandas’ pivot tables: look at the official documentation for more examples of pivot tables.
  • xlrd: You shouldn’t need to interact directly with the xlrd API, but just in case you want to dig deeper, refer to the official documentation.
  • the SQL INSERT statement: see our tutorial for many examples of replacing placeholders on a INSERT statement template.