Introduction to Pandas DataFrames

The pandas Python library is one of the most widely-used packages by software engineers and especially data scientists. This is because of its extremely useful pandas DataFrame data structure, which contains rows and columns similar to a spreadsheet.

The DataFrame object is similar to a spreadsheet in the sense that it has rows and columns. However, pandas DataFrames have much more powerful capabilities than a spreadsheet application like Microsoft Excel. That’s saying a lot because you know how much we like VBA and Microsoft Excel.

In this tutorial, we’ll teach you how to perform SQL-like operations in pandas using the pandas join method.

This is a guest post by Nick McCullum, who teaches Python and SQL programming on his own website. I hope you enjoy it!

Collecting our Data

In order to perform SQL-like operations in pandas using the join method, we’ll first need to create some pandas DataFrames! The first step is to import the pandas library. You can import pandas with the following command:

import pandas as pd

Now that you’ve imported pandas, you can run the following code to create the DataFrame’s we’ll be using in this tutorial.

df1 = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
                   'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
       
df2 = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
                      'B': ['B0', 'B1', 'B2']})

We’ll also need to set the key column as the index of both DataFrames. We can do this with the following Python code:

df1.set_index('key', inplace=True)
df2.set_index('key', inplace=True)

With that out of the way, let’s start learning how to use the pandas join method! There are four types of joins available in pandas:

  • left
  • right
  • outer
  • inner

We will discuss each of these joins step-by-step through the rest of this tutorial.

Left Joins in Pandas DataFrames

In SQL, a LEFT JOIN returns all rows from the left table, even if there are no matches in the right table.

The operation behaves similarly in pandas. We can execute a left join using pandas DataFrames with the following code:

df1.join(df2, on='key', how = 'left')

In this case, df1 is the left table and df2 is the right table.

Here is what the output of this code looks like:

      A    B
key         
K0   A0   B0
K1   A1   B1
K2   A2   B2
K3   A3  NaN
K4   A4  NaN
K5   A5  NaN

In this join operation, the final DataFrame retained all of the elements of df1 since it was the left table of the operation.

Let’s move on to learning how to perform right joins using the pandas programming library.

Can't get enough Python?

Enter your email address for more free Python tutorials and tips.

Python is powerful! Show me more free Python tips

Right Joins in Pandas DataFrames

In SQL, a RIGHT JOIN returns all rows from the right table, even if there are no matches in the left table.

As before, the operation behaves similarly in pandas. We can execute a right join using pandas DataFrames with the following code:

df1.join(df2, on='key', how = 'right')

Just like in our first example, df1 is the left table and df2 is the right table.

Here is the output of this code:

      A   B
key        
K0   A0  B0
K1   A1  B1
K2   A2  B2

The next type of SQL-like operation we’ll learn how to perform is an outer join.

Outer Joins in Pandas DataFrames

In a SQL OUTER JOIN statement, rows from both tables are matched together. Unmatched rows in one or both tables are included by filling in a blank value in the missing cell.

If this sounds familiar, it’s because the LEFT JOIN and RIGHT JOIN statements are both technically examples of one-way OUTER JOINs. You can think of an OUTER JOIN as a directionless version of the LEFT JOIN and RIGHT JOIN statements.

We can write an OUTER JOIN statement on our pandas DataFrames by changing the how argument to outer, like this:

df1.join(df2, on='key', how = 'outer')

Here is the output of this code:

      A    B
key         
K0   A0   B0
K1   A1   B1
K2   A2   B2
K3   A3  NaN
K4   A4  NaN
K5   A5  NaN

Because of its high tolerance for unmatched data, the OUTER JOIN statement is a very flexible SQL query.

Its close cousin, the INNER JOIN statement, is not so versatile.

Inner Joins in Pandas DataFrames

In SQL, an INNER JOIN statement matches rows from columns in different tables, but only where a matching value can be found in both tables. Said differently, INNER JOIN statements do not include rows where a cell is missing from either table.

INNER JOIN statements applied to pandas DataFrames behave similarly. If data is missing from either DataFrame, then that row will not be included in the DataFrame that is returned by the pandas join method.

To help understand this better, let’s consider an example. We can write an INNER JOIN statement in pandas by changing the how argument to inner, like this:

df1.join(df2, on='key', how = 'inner')

Here is the output of this code

      A   B
key        
K0   A0  B0
K1   A1  B1
K2   A2  B2

As you can see, the returned pandas DataFrame only contained rows that held valid data in both df1 and df2.

Final Thoughts

In this tutorial, you learned how to write SQL-like join statements on DataFrames using the built-in pandas join method. Specifically, we discussed four types of SQL joins and how to simulate them with the Python pandas library:

  • LEFT JOIN
  • RIGHT JOIN
  • OUTER JOIN
  • INNER JOIN

In my experience, writing quick join method calls in pandas can be much quicker than connecting to a more robust SQL database and running legitimate SQL queries. Because of this, keeping this tutorial in mind when working with tabular data in the future can help speed up your development process nicely.

Did you find this free tutorial helpful? You can find more great Python tips and tutorials by subscribing to our systematic Python training program below.

Can't get enough Python?

Enter your email address for more free Python tutorials and tips.

Python is powerful! Show me more free Python tips