Importing a SQL Data Query into a Pandas DataFrame

Introduction
Creating a SQLAlchemy Engine
Importing SQL with Pandas read_sql_query
Additional read_sql_query Options
- index_col Option to rename our rows
- chunksize Option to reduce memory use

Introduction to Importing SQL Data with Pandas

In this tutorial, we’ll explain how to import SQL data and SQL queries into a Pandas DataFrame. We’ll primarily use the SQLAlchemy module to import our SQL data. SQLAlchemy can be used in conjunction with Pandas for importing SQL data from a variety of different SQL database types. Once we have a created a connection to a SQL database, we can use standard SQL queries to access databases and store data into Pandas DataFrames.

As an example for importing data, we will use a popular database of Iris species and their identifying characteristics. The database iris.sqlite is an open source database provided by the UCI Machine Learning Depository, and available as a SQLite database via Kaggle. We won’t worry about the details of the Iris data in this tutorial, as we’re only concerned about importing data for now. You’ll want to connect to your own database for your specific application.

Importing data into Pandas from a SQL database using SQLAlchemy involves two steps:

Establish a connection to the database using the SQLAlchemy create_engine function
Import data into a DataFrame using the engine

We’ll go over each part in detail to make importing from a SQL database fast and simple!

Creating a SQLAlchemy Engine

The free SQLAlchemy Python module gives users a standard interface for connecting to a variety of different databases. Once the module is installed we can use the same function, or engine, to connect a host of database types including:

Microsoft SQL Server
MySQL
Oracle
PostgreSQL
SQLite
Sybase

The primary function we will use for creating an engine will be the create_engine function. We can import this function directly into our Python session using:

from sqlalchemy import create_engine

create_engine has the format

[engine] = sqlalchemy.create_engine([database path])

where [engine] is the engine object that we will pass to Pandas later, and [database path] is the database type-specific path string to the database we want to connect to. The [database path] string is highly specific to your database type and location. I encourage you to look up your specific database string URL syntax in the SQLAlchemy Manual

Our example data is in a SQLite database located in our working directory, so we will call our create_engine function as follows:

engine = create_engine("sqlite:///iris.sqlite")  # SQLite database named iris.sqlite in our working directory

Now that our engine has been created, we can use it within Pandas to import our data.

Importing SQL with Pandas read_sql_query

Once we’ve established a method of connecting to the database, we can directly query the database with a standard SQL query submitted through the Pandas function read_sql_query.

This Pandas function has the format

[dataframe] = pandas.read_sql_query([SQL Query], [engine], [other])

where

[dataframe] is the new DataFrame containing the imported data
[SQL Query] is a string containing the SQL query
[engine] is the engine we created in the previous section
[other] is a collection of optional import options

The input SQL query is a string written in standard SQL syntax. If you know SQL, then importing data is as easy as inserting your SQL query into the read_sql_query function. For example, we can select all data from the Iris table of the database with the following commands:

import pandas as pd 
from sqlalchemy import create_engine
engine = create_engine("sqlite:///iris.sqlite")  # Creating the engine

query = "SELECT * FROM Iris"  # String containing the SQL query to select all rows
df = pd.read_sql_query(query, engine)  # Finally, importing the data into DataFrame df
df.info()
> <class 'pandas.core.frame.DataFrame'>
> RangeIndex: 150 entries, 0 to 149
> Data columns (total 6 columns):
> Id               150 non-null int64
> SepalLengthCm    150 non-null float64
> SepalWidthCm     150 non-null float64
> PetalLengthCm    150 non-null float64
> PetalWidthCm     150 non-null float64
> Species          150 non-null object
> dtypes: float64(4), int64(1), object(1)
> memory usage: 7.1+ KB

df.head()
>    Id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm      Species
> 0   1            5.1           3.5            1.4           0.2  Iris-setosa
> 1   2            4.9           3.0            1.4           0.2  Iris-setosa
> 2   3            4.7           3.2            1.3           0.2  Iris-setosa
> 3   4            4.6           3.1            1.5           0.2  Iris-setosa
> 4   5            5.0           3.6            1.4           0.2  Iris-setosa

This example reads our entire SQL database into a Pandas DataFrame. The head function simply prints the first few rows of our DataFrame to our screen for convenience.

Get Our Python Developer Kit for Free

I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.

Additional read_sql_query Options

The pandas.read_sql_query has several other options available for importing data. The majority of these options involve formatting the incoming data, much like the formatting in our importing external files to DataFrames tutorial. This tutorial will examine two of the most useful options, index_col and chunksize.

index_col Option

Like our importing external files to DataFrames tutorial, we can specify a column for Pandas to use as indices using the index_col option. This let’s us “renumber” our rows however we want.

import pandas as pd 
from sqlalchemy import create_engine
engine = create_engine("sqlite:///iris.sqlite")  # Creating the engine
query = "SELECT * FROM Iris"  # String containing the SQL query

df = pd.read_sql_query(query, engine, index_col="Id")  # Using the first column as indices
df.head()
>     SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm      Species
> Id                                                                       
> 1             5.1           3.5            1.4           0.2  Iris-setosa
> 2             4.9           3.0            1.4           0.2  Iris-setosa
> 3             4.7           3.2            1.3           0.2  Iris-setosa
> 4             4.6           3.1            1.5           0.2  Iris-setosa
> 5             5.0           3.6            1.4           0.2  Iris-setosa

chunksize Option to reduce memory use

If the database we are connecting to is large, we may either run out of memory storing the DataFrame or take an unacceptable amount of time to access it. Instead of importing the entire data set from the SQL query, we can use the chunksize option to return an iterator over the data of a given number of rows. For example, if we specify chunksize=5, then each yield of the iterator object will contain the next 5 rows of the query. These iterators allow us to perform analyses on a large database in chunks without exceeding our computer’s memory limits.

For example, we can use an iterator on the Iris SQL query like this:

import pandas as pd 
from sqlalchemy import create_engine
engine = create_engine("sqlite:///iris.sqlite")  # Creating the engine
query = "SELECT * FROM Iris"  # String containing the SQL query

df = pd.read_sql_query(query, engine, chunksize=5)  # Using a chunk size of 5
type(df)  # The type is now a dataframe generator
> generator
for chunk in df:
	print(chunk.head())  # Print the head of each generated chunk
	break  # Break after one printout, for demonstration
>    Id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm      Species
> 0   6            5.4           3.9            1.7           0.4  Iris-setosa
> 1   7            4.6           3.4            1.4           0.3  Iris-setosa
> 2   8            5.0           3.4            1.5           0.2  Iris-setosa
> 3   9            4.4           2.9            1.4           0.2  Iris-setosa
> 4  10            4.9           3.1            1.5           0.1  Iris-setosa

Remember: Specifying the chunksize option will return a DataFrame generator, not a DataFrame itself. Attempting to access the generator object as though it were a DataFrame will raise an exception:

df.head()
> AttributeError: 'generator' object has no attribute 'head'

To access DataFrames from the DataFrame generator without creating an error, an iterator loop like in the example above must be used.

Did you find this free tutorial helpful? Share this article with your friends, classmates, and coworkers on Facebook and Twitter! When you spread the word on social media, you’re helping us grow so we can continue to provide free tutorials like this one for years to come.