- Introduction
- Creating a SQLAlchemy Engine
- Importing SQL with Pandas read_sql_query
- Additional read_sql_query Options
Introduction to Importing SQL Data with Pandas
In this tutorial, we’ll explain how to import SQL data and SQL queries into a Pandas DataFrame. We’ll primarily use the SQLAlchemy module to import our SQL data. SQLAlchemy can be used in conjunction with Pandas for importing SQL data from a variety of different SQL database types. Once we have a created a connection to a SQL database, we can use standard SQL queries to access databases and store data into Pandas DataFrames.
As an example for importing data, we will use a popular database of Iris species and their identifying characteristics. The database iris.sqlite
is an open source database provided by the UCI Machine Learning Depository, and available as a SQLite database via Kaggle. We won’t worry about the details of the Iris data in this tutorial, as we’re only concerned about importing data for now. You’ll want to connect to your own database for your specific application.
Importing data into Pandas from a SQL database using SQLAlchemy involves two steps:
- Establish a connection to the database using the SQLAlchemy
create_engine
function - Import data into a DataFrame using the engine
We’ll go over each part in detail to make importing from a SQL database fast and simple!
Creating a SQLAlchemy Engine
The free SQLAlchemy Python module gives users a standard interface for connecting to a variety of different databases. Once the module is installed we can use the same function, or engine, to connect a host of database types including:
- Microsoft SQL Server
- MySQL
- Oracle
- PostgreSQL
- SQLite
- Sybase
The primary function we will use for creating an engine will be the create_engine
function. We can import this function directly into our Python session using:
from sqlalchemy import create_engine
create_engine
has the format
[engine] = sqlalchemy.create_engine([database path])
where
Our example data is in a SQLite database located in our working directory, so we will call our create_engine
function as follows:
engine = create_engine("sqlite:///iris.sqlite") # SQLite database named iris.sqlite in our working directory
Now that our engine has been created, we can use it within Pandas to import our data.
Importing SQL with Pandas read_sql_query
Once we’ve established a method of connecting to the database, we can directly query the database with a standard SQL query submitted through the Pandas function read_sql_query
.
This Pandas function has the format
[dataframe] = pandas.read_sql_query([SQL Query], [engine], [other])
where
[dataframe]
is the new DataFrame containing the imported data[SQL Query]
is a string containing the SQL query[engine]
is the engine we created in the previous section[other]
is a collection of optional import options
The input SQL query is a string written in standard SQL syntax. If you know SQL, then importing data is as easy as inserting your SQL query into the read_sql_query
function. For example, we can select all data from the Iris table of the database with the following commands:
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine("sqlite:///iris.sqlite") # Creating the engine
query = "SELECT * FROM Iris" # String containing the SQL query to select all rows
df = pd.read_sql_query(query, engine) # Finally, importing the data into DataFrame df
df.info()
> <class 'pandas.core.frame.DataFrame'>
> RangeIndex: 150 entries, 0 to 149
> Data columns (total 6 columns):
> Id 150 non-null int64
> SepalLengthCm 150 non-null float64
> SepalWidthCm 150 non-null float64
> PetalLengthCm 150 non-null float64
> PetalWidthCm 150 non-null float64
> Species 150 non-null object
> dtypes: float64(4), int64(1), object(1)
> memory usage: 7.1+ KB
df.head()
> Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
> 0 1 5.1 3.5 1.4 0.2 Iris-setosa
> 1 2 4.9 3.0 1.4 0.2 Iris-setosa
> 2 3 4.7 3.2 1.3 0.2 Iris-setosa
> 3 4 4.6 3.1 1.5 0.2 Iris-setosa
> 4 5 5.0 3.6 1.4 0.2 Iris-setosa
This example reads our entire SQL database into a Pandas DataFrame. The head
function simply prints the first few rows of our DataFrame to our screen for convenience.
Code More, Distract Less: Support Our Ad-Free Site
You might have noticed we removed ads from our site - we hope this enhances your learning experience. To help sustain this, please take a look at our Python Developer Kit and our comprehensive cheat sheets. Each purchase directly supports this site, ensuring we can continue to offer you quality, distraction-free tutorials.
Additional read_sql_query Options
The pandas.read_sql_query
has several other options available for importing data. The majority of these options involve formatting the incoming data, much like the formatting in our importing external files to DataFrames tutorial. This tutorial will examine two of the most useful options, index_col
and chunksize
.
index_col Option
Like our importing external files to DataFrames tutorial, we can specify a column for Pandas to use as indices using the index_col
option. This let’s us “renumber” our rows however we want.
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine("sqlite:///iris.sqlite") # Creating the engine
query = "SELECT * FROM Iris" # String containing the SQL query
df = pd.read_sql_query(query, engine, index_col="Id") # Using the first column as indices
df.head()
> SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
> Id
> 1 5.1 3.5 1.4 0.2 Iris-setosa
> 2 4.9 3.0 1.4 0.2 Iris-setosa
> 3 4.7 3.2 1.3 0.2 Iris-setosa
> 4 4.6 3.1 1.5 0.2 Iris-setosa
> 5 5.0 3.6 1.4 0.2 Iris-setosa
chunksize Option to reduce memory use
If the database we are connecting to is large, we may either run out of memory storing the DataFrame or take an unacceptable amount of time to access it. Instead of importing the entire data set from the SQL query, we can use the chunksize
option to return an iterator over the data of a given number of rows. For example, if we specify chunksize=5
, then each yield of the iterator object will contain the next 5 rows of the query. These iterators allow us to perform analyses on a large database in chunks without exceeding our computer’s memory limits.
For example, we can use an iterator on the Iris SQL query like this:
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine("sqlite:///iris.sqlite") # Creating the engine
query = "SELECT * FROM Iris" # String containing the SQL query
df = pd.read_sql_query(query, engine, chunksize=5) # Using a chunk size of 5
type(df) # The type is now a dataframe generator
> generator
for chunk in df:
print(chunk.head()) # Print the head of each generated chunk
break # Break after one printout, for demonstration
> Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
> 0 6 5.4 3.9 1.7 0.4 Iris-setosa
> 1 7 4.6 3.4 1.4 0.3 Iris-setosa
> 2 8 5.0 3.4 1.5 0.2 Iris-setosa
> 3 9 4.4 2.9 1.4 0.2 Iris-setosa
> 4 10 4.9 3.1 1.5 0.1 Iris-setosa
Remember: Specifying the chunksize
option will return a DataFrame generator, not a DataFrame itself. Attempting to access the generator object as though it were a DataFrame will raise an exception:
df.head()
> AttributeError: 'generator' object has no attribute 'head'
To access DataFrames from the DataFrame generator without creating an error, an iterator loop like in the example above must be used.
Did you find this free tutorial helpful? Share this article with your friends, classmates, and coworkers on Facebook and Twitter! When you spread the word on social media, you’re helping us grow so we can continue to provide free tutorials like this one for years to come.