Introduction to Pandas DataFrames
The pandas Python library is one of the most widely-used packages by software engineers and especially data scientists. This is because of its extremely useful pandas DataFrame data structure, which contains rows and columns similar to a spreadsheet.
The DataFrame
object is similar to a spreadsheet in the sense that it has rows and columns. However, pandas DataFrames have much more powerful capabilities than a spreadsheet application like Microsoft Excel. That’s saying a lot because you know how much we like VBA and Microsoft Excel.
In this tutorial, we’ll teach you how to perform SQL-like operations in pandas using the pandas join
method.
This is a guest post by Nick McCullum, who teaches Python and SQL programming on his own website. I hope you enjoy it!
- Collecting our Data
- Left Joins in Pandas DataFrames
- Right Joins in Pandas DataFrames
- Outer Joins in Pandas DataFrames
- Inner Joins in Pandas DataFrames
- Final Thoughts
Collecting our Data
In order to perform SQL-like operations in pandas using the join
method, we’ll first need to create some pandas DataFrames! The first step is to import the pandas library. You can import pandas with the following command:
import pandas as pd
Now that you’ve imported pandas, you can run the following code to create the DataFrame’s we’ll be using in this tutorial.
df1 = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
df2 = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
'B': ['B0', 'B1', 'B2']})
We’ll also need to set the index
of both DataFrames. We can do this with the following Python code:
df1.set_index('key', inplace=True)
df2.set_index('key', inplace=True)
With that out of the way, let’s start learning how to use the pandas join
method! There are four types of joins available in pandas:
left
right
outer
inner
We will discuss each of these joins step-by-step through the rest of this tutorial.
Left Joins in Pandas DataFrames
In SQL, a LEFT JOIN returns all rows from the left table, even if there are no matches in the right table.
The operation behaves similarly in pandas. We can execute a left join using pandas DataFrames with the following code:
df1.join(df2, on='key', how = 'left')
In this case,
Here is what the output of this code looks like:
A B key K0 A0 B0 K1 A1 B1 K2 A2 B2 K3 A3 NaN K4 A4 NaN K5 A5 NaN
In this join operation, the final DataFrame retained all of the elements of
Let’s move on to learning how to perform right joins using the pandas programming library.
Get Our Python Developer Kit for Free
I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.
Right Joins in Pandas DataFrames
In SQL, a RIGHT JOIN returns all rows from the right table, even if there are no matches in the left table.
As before, the operation behaves similarly in pandas. We can execute a right join using pandas DataFrames with the following code:
df1.join(df2, on='key', how = 'right')
Just like in our first example,
Here is the output of this code:
A B key K0 A0 B0 K1 A1 B1 K2 A2 B2
The next type of SQL-like operation we’ll learn how to perform is an outer join.
Outer Joins in Pandas DataFrames
In a SQL OUTER JOIN statement, rows from both tables are matched together. Unmatched rows in one or both tables are included by filling in a blank value in the missing cell.
If this sounds familiar, it’s because the LEFT JOIN and RIGHT JOIN statements are both technically examples of one-way OUTER JOINs. You can think of an OUTER JOIN as a directionless version of the LEFT JOIN and RIGHT JOIN statements.
We can write an OUTER JOIN statement on our pandas DataFrames by changing the how
argument to
df1.join(df2, on='key', how = 'outer')
Here is the output of this code:
A B key K0 A0 B0 K1 A1 B1 K2 A2 B2 K3 A3 NaN K4 A4 NaN K5 A5 NaN
Because of its high tolerance for unmatched data, the OUTER JOIN statement is a very flexible SQL query.
Its close cousin, the INNER JOIN statement, is not so versatile.
Inner Joins in Pandas DataFrames
In SQL, an INNER JOIN statement matches rows from columns in different tables, but only where a matching value can be found in both tables. Said differently, INNER JOIN statements do not include rows where a cell is missing from either table.
INNER JOIN statements applied to pandas DataFrames behave similarly. If data is missing from either DataFrame, then that row will not be included in the DataFrame that is returned by the pandas join
method.
To help understand this better, let’s consider an example. We can write an INNER JOIN statement in pandas
by changing the how
argument to
df1.join(df2, on='key', how = 'inner')
Here is the output of this code
A B key K0 A0 B0 K1 A1 B1 K2 A2 B2
As you can see, the returned pandas DataFrame only contained rows that held valid data in both
Final Thoughts
In this tutorial, you learned how to write SQL-like join
statements on DataFrames using the built-in pandas join
method. Specifically, we discussed four types of SQL joins and how to simulate them with the Python pandas library:
- LEFT JOIN
- RIGHT JOIN
- OUTER JOIN
- INNER JOIN
In my experience, writing quick join
method calls in pandas can be much quicker than connecting to a more robust SQL database and running legitimate SQL queries. Because of this, keeping this tutorial in mind when working with tabular data in the future can help speed up your development process nicely.
Did you find this free tutorial helpful? You can find more great Python tips and tutorials by subscribing to our systematic Python training program below.
Get Our Python Developer Kit for Free
I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.