The aim of this tutorial is to extract data from an Excel sheet and load the data into an SQLite database using Python. We will use the sqlite3
standard module, and the pandas
and xrld
third-party libraries. Since data can be stored in different ways in an Excel sheet, we will address three case studies: a plain Excel sheet, an Excel sheet with named ranges, and pivot tables.
- The pandas and xlrd Modules
- Case Study 1: A Simple Excel Sheet
- Case Study 2: An Excel Sheet with Named Ranges
- Case Study 3: An Excel Pivot Table
- Closing Thoughts
- References
The pandas and xlrd Modules
Before trying to read data from an Excel sheet with Python, we have to introduce the pandas
and xlrd
libraries. In order to make the algorithms in the case studies more readable, we’re going to define two classes, namely ExcelDocument
and ExcelSheet
.
Setting up your Execution Environment
Since both pandas
and xlrd
are third-party libraries, we have to integrate them in our Python environment. You can install them using the following commands:
pythonexec -m pip install pandas pythonexec -m pip install xlrd
where python3.6
, python3.7
, and the like. The pip
script and passes some arguments to it. Take a look at the official documentation for a complete guide on using pip
. After installing all required packages, both libraries can be used in a Python module with the appropriate import
statements:
import pandas
import xlrd
Mapping Excel Types into Python Types
Much like SQLite, Excel stores the data type along with the value of a cell. In xlrd
, cells are represented by a xlrd.sheet.Cell
object, whose type is
Type Constant | Python Type |
---|---|
XL_CELL_EMPTY | Empty String |
XL_CELL_TEXT | Unicode String |
XL_CELL_NUMBER | float |
XL_CELL_DATE | float |
XL_CELL_BOOLEAN | int, 1 stands for True , 0 stands for False . |
XL_CELL_ERROR | int representing an error code. Use error_text_from_code() to get the respective error message. |
XL_CELL_BLANK | Empty String. |
As we have already pointed out in another tutorial, all SQLite data types are represented by just five Python types, as indicated in the following table:
Python Type | SQLite Type |
---|---|
None | NULL |
int | Integer |
float | Real |
str | Text |
bytes | BLOB |
If we compare the right column of the previous table with the right column of this table, we can see that all Excel types can be represented by an SQLite type, but beware that:
- All numbers in Excel are represented as
float
objects, even if they don’t have a decimal part. There are cases when anint
type is required instead, as we will see in an example below. - All dates in Excel are represented as
float
objects. The next subsection is devoted to the problem of converting between various date formats.
Mapping Dates
The Date type actually stores both date and time as a float
object. The xlrd.xldate
module provides many functions for converting from and to float
dates:
Function | Description |
---|---|
xldate_as_tuple(xldate, datemode) | Converts a float date into a (year, month, day, hour, minutes, seconds) tuple . Then this tuple can be fed to datetime.datetime(*tuple) . |
xldate_as_datetime(xldate, datemode) | Converts a float date into a datetime.datetime object. |
xldate.xldate_from_datetime(datetime_tuple, datemode) | Converts a date and time tuple into an Excel float . The |
xldate_from_date_tuple(date_tuple, datemode) | Converts a (year, month, day) tuple into an Excel float . It equals xldate_from_datetime_tuple(date_tuple+(0, 0, 0), datemode) . |
xldate_from_time_tuple(time_tuple, datemode) | Converts a (hour, minutes, seconds) tuple into an Excel float . It equals xldate_from_datetime_tuple((1899, 12, 31)+time_tuple, 0) or xldate_from_datetime_tuple((1904, 1, 1)+time_tuple, 1) , depending on the |
The following examples make use of all these functions:
import xlrd
import datetime
def test_xlrd_datetime_functions():
""" Test all function from the `xlrd.xldate` module. """
# Get the current date and time
today_ms = datetime.datetime.now()
today = datetime.datetime(*today_ms.utctimetuple()[ : 6])
# Converts a `tuple` date into a `float` date
fdate = xlrd.xldate.xldate_from_date_tuple(today.utctimetuple()[ : 3], 0)
# Converts a `tuple` time into a `float` time
ftime = xlrd.xldate.xldate_from_time_tuple(
(today.hour, today.minute, today.second))
# Converts a `tuple` date and time into a `float` date and time
fdatetime = xlrd.xldate.xldate_from_datetime_tuple(
today.utctimetuple()[ : 6], 0)
# Converts a `float` date into a tuple
tdate = xlrd.xldate.xldate_as_tuple(fdatetime, 0)
# Converts a `float` date into a `datetime.datetime` object.
date = xlrd.xldate.xldate_as_datetime(fdatetime, 0)
print('Today is %s' % today)
print('float date is %.3f (%s)' \
% (fdate, xlrd.xldate.xldate_as_datetime(fdate, 0)))
print('float time is %.3f (%s)' \
% (ftime, xlrd.xldate.xldate_as_datetime(ftime, 0)))
print('float datetime is %3.f' % fdatetime)
print('date tuple is ', tdate)
print('today == date? %s' % (today == date))
print('today == datetime.datetime(*tdate)? %s' \
% (today == datetime.datetime(*tdate)))
Let’s have a closer look at the code:
- the
today_ms variable holds the current date and time as adatetime.datetime
object. This is an immutable object, so once it has been created, you can’t change any of the date or time components. - the
utctimetuple()
method returns the date and time as a (year, month, day, hour, minute, second, microsecond) tuple. - thetoday date is just like thetoday_ms date, but we stripped themicrosecond component, since all functions in thexlrd.xldate
module require time objects with just hour, minute, and second components. - the
fdate date is created from a (year, month, day) tuple, which we get from theutctimetuple()
method. The time is set to 00:00:00 am. In all our examples we will set thedatemode parameter of the variousxldate
functions to 0, meaning that all dates will be computed starting form 1900-01-01. - the
ftime time is created from a (hour, minute, second) tuple, which we get from the properties with the same names of thetoday
object. The date is set to 1899-12-31, the day before the first valid date in this date mode. - the
fdatetime time is created from the first 6 items of the tuple returned by theutctimetuple()
method. - the
tdate tuple can be used to create an object that must equal totoday , i.e.today == datetime.datetime(*tdate)
. - the
date object must also equal totoday , since it has been created using the same date and time components.
A possible output of test_xlrd_datetime_functions()
is:
Today is 2019-04-04 09:22:25 float date is 43559.000 (2019-04-04 00:00:00) float time is 0.391 (1899-12-31 09:22:25) float datetime is 43559.391 date tuple is (2019, 4, 4, 9, 22, 25) today == date? True today == datetime.datetime(*tdate)? True
Regardless of the current date and time, both tests should evaluate to True
.
Now, we know how to represent an Excel date as a Python string or float
. On the SQLite side, there isn’t a DATE
type, but dates can be stored in the following ways:
- as a TEXT, using ISO-8601 strings.
- as a REAL, using Julian day numbers.
- as an INTEGER, using the Unix Time format.
An ISO-8601 date is in the format YYYY-MM-DDThh:mm:ss.sTZD
, where:
- YYYY is a four-digit year.
- MM is a two-digit month (1-12).
- DD is a two-digit day.
- T separates the date from the the time. It must be omitted if the time is omitted.
- hh is a two-digit hour.
- mm is a two-digit minute (0-59).
- ss is a two-digit second (0-59). It can be omitted.
- s is a decimal fraction of a second. It can be omitted.
- TZD can be either the letter
Z
, meaning the it is an UTC (Coordinated Universal Time) time, or an offset from the UTC time+hh:mm
or-hh:mm
; e.g.2019-04-04T19:20:30.45+01:00
indicates the 4th April 2019, at 8:30 pm, 1 hour ahead the UTC time.
SQLite provides various scalar functions to convert between these formats. In all examples in this tutorial, we will stick to the string representation of dates. As we will see below, the ExcelSheet
class will automatically take care of the conversion between Excel float dates and string dates.
Representing an Excel Document with the ExcelDocument Class
Now, we will describe two classes, ExcelDocument
, which represents an Excel document, and ExcelSheet
, which represents a single sheet within an Excel workbook. These classes will do most of the hard work, allowing us to describe the algorithms for our case studies in a straightforward way. Both classes are located in the exceldoc.py
Python module, which has been provided as an attachment to this tutorial.
The ExcelDocument Class
import itertools
import pandas
import pandas.io.excel as excel
import xlrd
class ExcelDocument(excel.ExcelFile):
""" Wrapper around the `excel.ExcelFile` class. """
def __getitem__(self, name):
""" Gets a sheet by name. """
if name in self.sheet_names:
return ExcelSheet(self.book.sheet_by_name(name))
else:
raise KeyError('Invalid sheet name "%s"' % name)
def __iter__(self):
""" Iterates over the sheets in the document. """
for name in self.sheet_names:
yield ExcelSheet(self.book.sheet_by_name(name))
def __len__(self):
""" Number of sheets in the document. """
return self.book.nsheets
@property
def global_names(self):
""" Lists all global names. """
return { name : self.book.name_map[name] \
for name, scope in self.book.name_and_scope_map \
if scope == -1 }
Get Our Python Developer Kit for Free
I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.
This class adds the following methods to its base class:
- the
__getitem__(name)
method allows access to all Excel sheets in a workbook by name. It can raise aKeyError
exception if there isn’t any sheet by that name; e.g.doc['Sheet1']
will return anExcelSheet
object for the sheet named Sheet1. - the
__iter__()
method iterates over the sheets in the workbook. - the
__len__()
method returns the number of sheets in the workbook. - the
global_names
property returns a dictionary mapping all global names in the Excel document to their values. We will discuss global and local names below.
We won’t give any example of this class right now, but we will put it in good use when we will address our case studies.
The ExcelFile Class
The base class ExcelFile
has the following noteworthy methods and attributes:
- To create a new object from this class, you must provide the path to the Excel file, e.g.
ExcelFile('doc.xls')
creates a new object for thedoc.xls
Excel file. - the
book read-only attribute represents an Excel workbook, collecting all sheets in the file.book.nsheets is the number of sheets. Itssheet_by_name(name)
method retrieves a sheet object by its name.name_map andname_and_scope_map are two dictionary holding named ranges and functions, which can be accessed by name or by a (name, scope) tuple, respectively. We will tell more about named ranges below.
Representing an Excel Sheet with the ExcelSheet Class
class ExcelSheet:
""" Class that represents an Excel sheet. """
def __init__(self, sheet, has_header=True):
def find_sheet_idx(sheet):
idx = -1
for sh in sheet.book.sheets():
idx += 1
if sh.name == sheet.name:
break
if idx == -1:
raise ValueError("Couldn't locate sheet %s." % sheet.name)
return idx
self._sheet = sheet
self._has_header = bool(has_header)
self._sheet_idx = find_sheet_idx(sheet)
def iter_rows(self):
""" Iterates over the rows of data, skipping the header. """
start = 1 if self._has_header else 0
for r in range(start, self._sheet.nrows):
yield [ xlrd.xldate.xldate_as_datetime(c.value, 0).date() \
if c.ctype == xlrd.XL_CELL_DATE else c.value \
for c in self._sheet.row(r) ]
def data_frame(self):
""" Build a `pandas` data frame from all named ranges. """
def column_from_range(named_range):
sh, rlo, rhi, clo, chi = named_range.area2d()
return [ sh.cell(r, clo).value \
for r in range(rlo, rhi) ]
ranges = self.named_ranges
return pandas.DataFrame(
{ name : column_from_range(ranges[name]) \
for name in ranges })
def iter_named_ranges(self):
""" Iterates over rows of data using named ranges. """
def iter_single_range(named_range):
sh, rlo, rhi, clo, chi = named_range.area2d()
for r in range(rlo, rhi):
yield sh.cell(r, clo).value
ranges = self.named_ranges
header = tuple(ranges.keys())
for row in itertools.zip_longest(
*[ iter_single_range(ranges[name]) for name in ranges ]):
yield { key : value for key, value in zip(header, row) }
@property
def name(self):
return self._sheet.name
@property
def nrows(self):
return self._sheet.nrows
@property
def ncols(self):
return self._sheet.ncols
def __getitem__(self, n):
""" Returns the `n`-th row. """
if not (0 <= n < self.nrows):
raise ValueError('0 >= row > %d, but %d given.'
% (self.nrows, n))
return self._sheet.row(n)
def __iter__(self):
""" Iterates over all cells left to right, top to bottom. """
for r in range(self.nrows):
for c in range(self.ncols):
yield self._sheet.row(r)[c].value
@property
def local_names(self):
""" List all local names. """
return { name : self._sheet.book.name_and_scope_map[(name, scope)] \
for name, scope in self._sheet.book.name_and_scope_map \
if scope == self._sheet_idx }
@property
def named_ranges(self):
""" List of all named ranges for this sheet. """
names = self.local_names
return { name: names[name] \
for name in names \
if names[name].result.kind in (xlrd.oREF, xlrd.oREL) }
This class adds the following method to the underlying class:
- the
__init__(sheet, has_header=True)
method builds a new object. Since the official documentation strongly advises against subclassing thexlrd.sheet.Sheet
class, we will use this workaround: we will store thesheet object, that all methods in theExcelSheet
class will access as needed. This method has an optionalhas_header parameter, which specifies if the first row of data should be regarded as a header, i.e. the list of all column names. Thefind_sheet_idx()
function maps a sheet name to its 0-based index into the sheets’ list. We need this index just to assign a named range or function to the appropriate scope, as we will see below. - the
iter_rows()
method iterates over the rows of the Excel sheet. It skips the first row if the excel sheet has a header. It is used mainly to feed data to theexecutemany()
method, as we will see below. - the
data_frame()
method builds apandas.DataFrame
object holding all columns of the local named ranges. See case study 3 for an example of its usage. - the
iter_named_ranges()
method iterates over all named ranges, building a dictionary for each row of data. Theiter_single_range()
function builds a column of cells from a named range. Theitertools.zip_longest()
function creates a row by extracting a cell from each column. If a column is too short, it fills the remaining cells with aNone
value. Finally, it uses a dictionary comprehension to map all column names to their value. In case study 2, theiter_named_ranges()
method will be called by theexecutemany()
method to build anINSERT
statement from a template by replacing named placeholders. - the
name
property returns the name of the sheet. - the
nrows
andncols
properties return, respectively, the number of rows and columns in the Excel sheet, including the header. - the
__getitem__(n)
method returns the n-th row from the Excel sheet. Both rows and columns are indexed starting from 0; e.g.sheet[r][c]
will return the value of the cell located in the r-th row and c-th column. - the
__iter__()
method iterates over the cell values from top to bottom, from left to right, including the header. - the
local_names
property returns a dictionary mapping all local names of an Excel sheet to their values. - the
local_ranges
property returns a dictionary mapping all local names referring to a range into their absolute range coordinates. We will deal with local names below.
Get Our Python Developer Kit for Free
I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.
The xlrd.sheet.Sheet Class
These are the methods and attributes of the xlrd.sheet.Sheet
that will be useful for us:
- the
nrows andncols read-only attributes are, respectively, the number of rows and the number of columns of the Excel sheet. - the
name attribute is the name of the Excel sheet. - The
row(n)
method returns the n-th row of the sheet, wheren must be between 0 andnrows - 1. - The
cell(r, c)
method retrieves the cell on the r-th row and c-th column.
Detecting Local Named Ranges
In Excel, you can assign a name to either a formula or a range. A name can be used only within its scope. A name can have one of the following scopes:
- global: (-1) the name can be used by all the sheets in the workbook into which it is defined.
- macro: (-2) the name can be used only inside a macro or VBA module. We won’t deal with these kinds of names here.
- local: (from 0 to
book.nsheets - 1) the name can be used only inside a sheet.
The ExcelDocument.global_names
property detects all global names, while the ExcelSheet.local_names
property lists all names belonging to a specific sheet. The ExcelSheet.named_ranges
property selects all local names that are named ranges. Let’s take a closer look at them, since they should make more sense now:
@property
def local_names(self):
return { name : self._sheet.book.name_and_scope_map[(name, scope)] \
for name, scope in self._sheet.book.name_and_scope_map \
if scope == self._sheet_idx }
@property
def named_ranges(self):
""" List of all named ranges for this sheet. """
names = self.local_names
return { name: names[name] \
for name in names \
if names[name].result.kind in (xlrd.oREF, xlrd.oREL) }
- the
local_names property builds adict
object from a dictionary comprehension, selecting all names whosescope matches the sheet index. - the
named_ranges property selects all names whose value is of typexlrd.oREF
orxlrd.oREL
, i.e. the value is a range.
The following table lists all possible values of the
Numeric Constant | Symbolic Constant | Python Represenation |
---|---|---|
-2 | xlrd.oREL | None , or a non-empty list of fully or partially relative Ref3D instances. |
-1 | xlrd.oREF | None , or a non-empty list of absolute Ref3D instances. |
0 | xlrd.oSTRG | An Unicode string. |
1 | xlrd.oUNK | The type of the result is unknown or ambiguous. Its value is None . |
2 | xlrd.oNUM | A float . |
3 | xlrd.oBOOL | An integer, with 0 meaning False , and 1 meaning True . |
4 | xlrd.oERR | None , or an integer error code. |
5 | xlrd.oMSNG | Placeholder for missing data. Its value is None . |
Many of these types are also Excel data types. Ranges have an area2d()
method, that returns a (sheet, rlo, rhi, clo, chi) tuple, where:
- sheet is the sheet to which the range belongs.
- rlo and rhi are, respectively, the lowest and highest row index.
- clo and chi are, respectively, the lowest and highest column index.
So if we want to iterate over the cell of the range from top to bottom, and from left to right, we can use the following loop:
sh, rlo, rhi, clo, cli = name.area2d()
for r in range(rlo, rhi):
for c in range(clo, chi):
yield sh.cell(r, c)
Both the ExcelSheet.iter_named_ranges()
and the ExcelSheet.data_frame()
methods use similar loops.
After setting up the basic tools of the trade, it is time to put them into practice. All our case studies will share a common pattern:
- The Shortcomings of this Approach section explains where our algorithm for extracting data from an Excel sheet falls short.
- The Data Layout section describes all columns’ names and data type for our sample data.
- The Header Detection section describes how columns’ names are detected by the
ExcelSheet
class. - The Implementation section gives an example of how to extract data from an Excel file and load them into an SQLite table.
Case Study 1: A Simple Excel Sheet
The easiest case for importing into a database is that of a spreadsheet where the first row lists all column names, and all other rows contain data.
Shortcomings of this Approach
This approach has several shortcomings, depending both on the way data is presented in the Excel file, and on our particular implementation:
- The algorithm assumes that the first row stores all column names.
- The algorithm doesn’t check if all rows have the same number of columns. It will pad all remaining cells with
None
values. - It doesn’t check if the number of columns in the Excel source matches the number of columns in the SQLite table. This may lead to an
sqlite3.Error
exception while trying to insert Excel data into an SQLite table. - It doesn’t check if the Excel data type matches the SQLite data type. The only type conversion performed implicitly is from
float
dates todatetime.datetime
dates, as we have seen above - The
executemany()
method uses positional placeholders instead of named placeholders, which are less error-prone, and which allows a more free arrangement of the columns in the Excel source. We will show an example of named placeholders in case study 2.
Data Layout
In the following table we will list the name and data type of both the Excel sheets and the SQLite table. The Excel sample can be found in the albums.xlsx
file, where each sheet stores an album. The SQLite database is located in the albums.db3
file, where all albums will be stored in the
Name | Excel Type | SQLite Type | Description |
---|---|---|---|
id | INTEGER PRIMARY KEY | The Song ID. | |
nr | XL_CELL_NUMBER | INTEGER NOT NULL | The track number. |
band | XL_CELL_TEXT | TEXT NOT NULL | The band's name. |
song | XL_CELL_TEXT | TEXT NOT NULL | The song's title. |
album | XL_CELL_TEXT | TEXT NOT NULL | The album's title. |
duration | XL_CELL_TEXT | TEXT NOT NULL | The song's duration, in minutes:seconds format. |
The id
column on the SQLite schema has no Excel counterpart, since the song’s ID is not stored in the Excel sheets, but it is automatically generated while loading data in the table.
Header Detection
The ExcelSheet
class assumes the first row of data holds the header. The header will be skipped when iterating over the data by using the ExcelSheet.iter_rows()
method.
Implementation
We will now implement an algorithm for extracting data from a basic Excel sheet and importing it into a sqlite database using Python pandas and xlrd (via our custom classes in the exceldoc.py file).
import sqlite3
from exceldoc import *
def case_study_1():
""" Test for case study 1. """
# Open the database connection and the Excel file
with sqlite3.connect('albums.db3') as db, \
ExcelDocument('albums.xlsx') as src:
insert_template = "INSERT INTO albums " \
"(nr, band, song, album, duration) " \
"VALUES (?, ?, ?, ?, ?);"
# Clear the database
db.execute('DELETE FROM albums;')
# Load data from each Excel sheet into the database
for sheet in src:
try:
db.executemany(insert_template, sheet.iter_rows())
except sqlite3.Error as e:
print(e)
db.rollback()
else:
db.commit()
# Check if all data have been loaded
select_stmt = 'SELECT DISTINCT band, album FROM albums;'
for row in db.execute(select_stmt).fetchall():
print(';'.join(row))
Get Our Python Developer Kit for Free
I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.
Let’s have a closer look at this example:
- First, we create an SQLite connection for the
albums.db3
database, and we open thealbums.xlsx
file, where all albums data are stored. - The
insert_template string stores theINSERT
statement template used to load data into the SQLite table. The Excel data will be matched to the corresponding SQLite column using the positional placeholder?
. - The
DELETE FROM albums;
statement deletes all data from thealbums table. - Then, we iterate over each sheet in the
albums.xlsx
workbook, assigning the current sheet tosheet . Theexecutemany()
method will replace each?
placeholder of theinsert_template statement with the corresponding field of a row from the iterator returned by thesheet.iter_rows()
method. If the insertion succeeds, then all changes to the database will be committed, otherwise they will be discarded. - Finally, we will check if all data have been loaded correctly by asking SQLite to list all bands and albums in the
albums table.
The output of the program should be:
Amaranthe;The Nexus Amberian Dawn;Innuendo Within Temptation;The Unforgiving Epica;The Quantum Enigma Sleeping Romance;Alba Unleash the Archers;Apex Metalite;Heroes in Time
Case Study 2: An Excel Sheet with Named Ranges
A slightly more elaborate case is that of an Excel sheet where each column is defined as a local named range. There may be a header in the first row, just to make it more human readable, but this algorithm will ignore it.
Shortcomings of this Approach
- This algorithm works only for XLS files. That’s where things starts to get really messy. As of the time of writing (
xlrd
version 1.1.0), there is a little difference between reading*.xls
and*.xlsx
files. The API provides a convenientName.area2d()
method to return an absolute reference to the cells within a named range, but it doesn’t work for*.xlsx
files. The API provides aName.formula_text attribute for XLSX files, which stores the named ranges as a qualified range literal; e.g.innuendo!$C$1:$C$12
represents all cells fromC1
toC12
of theinnuendo sheet. These references can be parsed using regular expressions, but this approach would clutter the algorithm, so we won’t address it here. - it doesn’t check for overlapping columns ranges.
- it doesn’t check if all ranges have the same number of rows. It will pad shorter columns with
None
values. - it doesn’t check if a range has more that one column, it just uses the leftmost column.
- it assumes that all column names from the Excel sheet will match a column name in the SQLite schema. This may lead to an
sqlite3.Error
exception while trying to insert Excel data into an SQLite table.
Data Layout
The data layout is the same as in our first case study. The sample SQLite database is located in the albums-named.db3
file. The sample Excel source is located in the albums-named.xlsx
file. There are only three sheets, and all sheets have no header in the first row. If you haven’t already downloaded it, all supporting files for this case study are in this ZIP file.
Header Detection
This algorithm gives no particular meaning to the first row of each Excel sheet, because it relies on named ranges to determine each column’s name. It will iterate over the ExcelSheet.named_ranges
objects, and it will build rows of data from them using the ExcelSheet.iter_named_ranges()
method.
Implementation
We will now implement the algorithm for extracting data from an Excel sheet which complies with all properties listed above. The extracted Excel data will be stored in a database using the sqlite3 module of Python.
import sqlite3
from exceldoc import *
def case_study_2():
""" Test for case study 2. """
# Open the database connection and the Excel file
with sqlite3.connect('albums-named.db3') as db, \
ExcelDocument('albums-named.xls') as src:
insert_template = "INSERT INTO albums " \
"(nr, band, song, album, duration) " \
"VALUES (:nr, :band, :song, :album, :duration);"
# Clear the database
db.execute('DELETE FROM albums;')
# Load data from each Excel sheet into the database
for sheet in src:
try:
db.executemany(insert_template, sheet.iter_named_ranges())
except sqlite3.Error as e:
print(e)
db.rollback()
else:
db.commit()
# Check if all data have been loaded
select_stmt = 'SELECT DISTINCT band, album FROM albums;'
for row in db.execute(select_stmt).fetchall():
print(';'.join(row))
This example closely resembles that of the first case study, but it is worth noting that:
- Once again, we create a database connection for the
albums-named.db3
file, and we open thealbums-named.xls
Excel file. - We delete all data in the
albums table. - The template for the
INSERT
statement is slightly different, since it employs named placeholders instead of positional placeholders. Named placeholders requires a dictionary instead of a tuple to match the row fields to the table columns. - We iterate over the sheets in the workbook, and we use
executemany(stmt, iterable)
, but this time theiterable parameter must be an iterator of dictionaries, provided by theiter_named_ranges()
method of the sheet (see above). - Finally, we will check if all data have been loaded correctly by asking SQLite to list all bands and albums in the
albums table of our newly created database.
Let’s see an example of how the replacement takes place in the executemany()
method. The first row of the innuendo
sheet is:
Nr | Band | Song | Album | Duration |
---|---|---|---|---|
1 | Amberian Dawn | Fame & Gloria | Innuendo | 4:24 |
This is the SQL statement before and after the substitution:
/* Statement with named placeholders */
INSERT INTO albums (nr, band, song, album, duration)
VALUES (:nr, :band, :song, :album, :duration);
/* Statement after the replacement */
INSERT INTO albums (nr, band, song, album, duration)
VALUES (1, 'Amberian Dawn', 'Fame & Gloria', 'Innuendo', '4:24');
The output of the case_study_2()
function should be:
Amaranthe;The Nexus Amberian Dawn;Innuendo Within Temptation;The Unforgiving
Case Study 3: An Excel Pivot Table
The xlrd
library has no way of detecting pivot tables, so in this section we will actually try to simulate an Excel pivot table by using the pandas.pivot_table()
function. Once you have the table, you can add it to your database with python sqlite3.
Shortcomings of this Approach
- This algorithm doesn’t actually read the pivot table, it just uses its data source to emulate the pivot table in
pandas
. - Since header detection works the same way as for case study 2, it can only be used with XLS files.
- Since it uses named ranges to detect the headers of the source sheet, all shortcomings of case study 2 will also apply to this approach.
Data Layout
The pivot.xls
file (download file) has two sheets:
Name | Excel Type | SQLite Type | Description |
---|---|---|---|
Date | XL_CELL_DATE | TEXT NOT NULL | Date in YYYY-MM-DD format. |
Seller | XL_CELL_TEXT | TEXT NOT NULL | Name of the seller. |
Income | XL_CELL_NUMBER | INTEGER NOT NULL | Daily income. |
Month | XL_CELL_NUMBER | INTEGER NOT NULL | Month when the sale took place. |
The pivot table uses the Seller
column as row field, the Month
column as column field, and the Income
column as data field. It uses SUM()
as an aggregate function for the Income
data.
Header Detection
The header is detected using named ranges. The header names are used as keys for each column of data in the pandas.DataFrame
object.
Implementation
We will now implement the algorithm for extracting data from an Excel sheet which complies with all properties listed above.
import pandas
from exceldoc import *
def case_study_3():
""" Test for case study 3. """
with ExcelDocument('pivot.xls') as src:
table = pandas.pivot_table(
src['pivot-data'].data_frame(),
columns = 'month',
values = 'income',
index = 'seller',
aggfunc = {'income' : sum},
fill_value = 0 )
print(table)
The output of the case_study_3()
function is:
month 1.0 2.0 3.0 4.0 seller Annie 0 2196 4736 0 Ian 0 3352 7806 0 John 1758 3086 5850 545 Mark 746 5038 6221 2299 Matthew 703 3936 2130 2896 Nina 1164 2371 5359 855
which resembles the contents of the pivot.xls
file, as you can see from this (LibreOffice) screenshot:
Closing Thoughts
In this tutorial we learned how to extract data from Excel sheets with different data layouts and import the data into a sqlite database using Python.
First, we described the Excel native data types, with particular emphasis on the date type. Then we became acquainted with the pandas
and xlrd
third-party libraries, admittedly learning as little as needed to serve our purposes.
We also presented two powerful classes, ExcelDocument
and ExcelSheet
, in order to simplify some implementation details (notably on named range’s detection), and to make our algorithms for data extraction easier to read.
Finally, we provided two algorithms for reading data from a simple Excel sheet and from a sheet with named ranges. Though the xlrd
library doesn’t provide support for pivot tables, we even showed how to emulate them by using the pandas.pivot_table()
function.
Did you find this free tutorial helpful? Share this article with your friends, classmates, and coworkers on Facebook and Twitter! When you spread the word on social media, you’re helping us grow so we can continue to provide free tutorials like this one for years to come.
Get Our Python Developer Kit for Free
I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.
References
- pandas’ pivot tables: look at the official documentation for more examples of pivot tables.
- xlrd: You shouldn’t need to interact directly with the
xlrd
API, but just in case you want to dig deeper, refer to the official documentation. - the SQL INSERT statement: see our tutorial for many examples of replacing placeholders on a
INSERT
statement template.