- Creating a SQLAlchemy Engine
- Importing SQL with Pandas read_sql_query
- Additional read_sql_query Options
Introduction to Importing SQL Data with Pandas
In this tutorial, we’ll explain how to import SQL data and SQL queries into a Pandas DataFrame. We’ll primarily use the SQLAlchemy module to import our SQL data. SQLAlchemy can be used in conjunction with Pandas for importing SQL data from a variety of different SQL database types. Once we have a created a connection to a SQL database, we can use standard SQL queries to access databases and store data into Pandas DataFrames.
As an example for importing data, we will use a popular database of Iris species and their identifying characteristics. The database
iris.sqlite is an open source database provided by the UCI Machine Learning Depository, and available as a SQLite database via Kaggle. We won’t worry about the details of the Iris data in this tutorial, as we’re only concerned about importing data for now. You’ll want to connect to your own database for your specific application.
Importing data into Pandas from a SQL database using SQLAlchemy involves two steps:
- Establish a connection to the database using the SQLAlchemy
- Import data into a DataFrame using the engine
We’ll go over each part in detail to make importing from a SQL database fast and simple!
Creating a SQLAlchemy Engine
The free SQLAlchemy Python module gives users a standard interface for connecting to a variety of different databases. Once the module is installed we can use the same function, or engine, to connect a host of database types including:
- Microsoft SQL Server
The primary function we will use for creating an engine will be the
create_engine function. We can import this function directly into our Python session using:
from sqlalchemy import create_engine
create_engine has the format
[engine] = sqlalchemy.create_engine([database path])
Our example data is in a SQLite database located in our working directory, so we will call our
create_engine function as follows:
engine = create_engine("sqlite:///iris.sqlite") # SQLite database named iris.sqlite in our working directory
Now that our engine has been created, we can use it within Pandas to import our data.
Importing SQL with Pandas read_sql_query
Once we’ve established a method of connecting to the database, we can directly query the database with a standard SQL query submitted through the Pandas function
This Pandas function has the format
[dataframe] = pandas.read_sql_query([SQL Query], [engine], [other])
[dataframe]is the new DataFrame containing the imported data
[SQL Query]is a string containing the SQL query
[engine]is the engine we created in the previous section
[other]is a collection of optional import options
The input SQL query is a string written in standard SQL syntax. If you know SQL, then importing data is as easy as inserting your SQL query into the
read_sql_query function. For example, we can select all data from the Iris table of the database with the following commands:
import pandas as pd from sqlalchemy import create_engine engine = create_engine("sqlite:///iris.sqlite") # Creating the engine query = "SELECT * FROM Iris" # String containing the SQL query to select all rows df = pd.read_sql_query(query, engine) # Finally, importing the data into DataFrame df df.info() > <class 'pandas.core.frame.DataFrame'> > RangeIndex: 150 entries, 0 to 149 > Data columns (total 6 columns): > Id 150 non-null int64 > SepalLengthCm 150 non-null float64 > SepalWidthCm 150 non-null float64 > PetalLengthCm 150 non-null float64 > PetalWidthCm 150 non-null float64 > Species 150 non-null object > dtypes: float64(4), int64(1), object(1) > memory usage: 7.1+ KB df.head() > Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species > 0 1 5.1 3.5 1.4 0.2 Iris-setosa > 1 2 4.9 3.0 1.4 0.2 Iris-setosa > 2 3 4.7 3.2 1.3 0.2 Iris-setosa > 3 4 4.6 3.1 1.5 0.2 Iris-setosa > 4 5 5.0 3.6 1.4 0.2 Iris-setosa
This example reads our entire SQL database into a Pandas DataFrame. The
head function simply prints the first few rows of our DataFrame to our screen for convenience.
Additional read_sql_query Options
pandas.read_sql_query has several other options available for importing data. The majority of these options involve formatting the incoming data, much like the formatting in our importing external files to DataFrames tutorial. This tutorial will examine two of the most useful options,
Like our importing external files to DataFrames tutorial, we can specify a column for Pandas to use as indices using the
index_col option. This let’s us “renumber” our rows however we want.
import pandas as pd from sqlalchemy import create_engine engine = create_engine("sqlite:///iris.sqlite") # Creating the engine query = "SELECT * FROM Iris" # String containing the SQL query df = pd.read_sql_query(query, engine, index_col="Id") # Using the first column as indices df.head() > SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species > Id > 1 5.1 3.5 1.4 0.2 Iris-setosa > 2 4.9 3.0 1.4 0.2 Iris-setosa > 3 4.7 3.2 1.3 0.2 Iris-setosa > 4 4.6 3.1 1.5 0.2 Iris-setosa > 5 5.0 3.6 1.4 0.2 Iris-setosa
chunksize Option to reduce memory use
If the database we are connecting to is large, we may either run out of memory storing the DataFrame or take an unacceptable amount of time to access it. Instead of importing the entire data set from the SQL query, we can use the
chunksize option to return an iterator over the data of a given number of rows. For example, if we specify
chunksize=5, then each yield of the iterator object will contain the next 5 rows of the query. These iterators allow us to perform analyses on a large database in chunks without exceeding our computer’s memory limits.
For example, we can use an iterator on the Iris SQL query like this:
import pandas as pd from sqlalchemy import create_engine engine = create_engine("sqlite:///iris.sqlite") # Creating the engine query = "SELECT * FROM Iris" # String containing the SQL query df = pd.read_sql_query(query, engine, chunksize=5) # Using a chunk size of 5 type(df) # The type is now a dataframe generator > generator for chunk in df: print(chunk.head()) # Print the head of each generated chunk break # Break after one printout, for demonstration > Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species > 0 6 5.4 3.9 1.7 0.4 Iris-setosa > 1 7 4.6 3.4 1.4 0.3 Iris-setosa > 2 8 5.0 3.4 1.5 0.2 Iris-setosa > 3 9 4.4 2.9 1.4 0.2 Iris-setosa > 4 10 4.9 3.1 1.5 0.1 Iris-setosa
Remember: Specifying the
chunksize option will return a DataFrame generator, not a DataFrame itself. Attempting to access the generator object as though it were a DataFrame will raise an exception:
df.head() > AttributeError: 'generator' object has no attribute 'head'
To access DataFrames from the DataFrame generator without creating an error, an iterator loop like in the example above must be used.
Did you find this free tutorial helpful? Share this article with your friends, classmates, and coworkers on Facebook, Twitter, and Google+! When you spread the word on social media, you’re helping us grow so we can continue to provide free tutorials like this one for years to come.This article was written by Cody Gilbert, contributing writer for The Python Tutorials Blog.
About The Python Tutorials Blog
The Python Tutorials Blog was created by Ryan Wells, a Nuclear Engineer and professional VBA Developer. Ryan developed a unique 3-part free Excel training program to help others quickly learn VBA in a natural setting: right inside Excel. After his successful VBA Tutorials, which have helped hundreds of thousands learn to write better macros, he built The Python Tutorials Blog to teach people Python in a similar systematic way.Follow