Introduction to Comparing Pandas DataFrames in Python
A common occurrence in data science is the validation of data between two executions of code. For example, suppose a data scientist performed an analysis to produce a data set, and it is your job to replicate that data set to validate the method. Given two Pandas DataFrames, how do we verify that they contain the same data? We can perform a simple True/False check on the two DataFrames to determine their structural equality using the Pandas equal
method, and if some differences exist between them, we can create a new data frame for comparing the individual differences. The ability to compare DataFrames and mark the differences is often overlooked in Pandas tutorials, but it truly is the most valuable part of any data comparison.
For this tutorial we’ll be comparing the following two DataFrames containing slightly different values of student grades
Suppose we now want to compare these spreadsheets to determine if there are differences, and to find out what those differences are. First we’ll need to import the Excel data into two Pandas DataFrames:
import pandas as pd
grades1 = pd.read_excel('exampleData1.xlsx', sheet_name='Grades1', index_col=0)
grades2 = pd.read_excel('exampleData2.xlsx', sheet_name='Grades2', index_col=0)
Now that our data is imported, we can compare these two DataFrames with a few different Python Pandas operations.
Simple DataFrame Comparison: equals Method
If we want to perform a simple comparison between dataframes to determine whether or not they’re identical, we should use the equals DataFrame method. This method has the format:
[pandas.DataFrame].equals([pandas.DataFrame])
where [pandas.DataFrame]
is a particular Pandas DataFrame you wish to compare to another. The equals
method returns a boolean of True if the DataFrames are identical or False if the DataFrames differ.
grades1.equals(grades1) # Compare a DataFrame to itself
> True
grades1.equals(grades2) # Compare unequal DataFrames
> False
Code More, Distract Less: Support Our Ad-Free Site
You might have noticed we removed ads from our site - we hope this enhances your learning experience. To help sustain this, please take a look at our Python Developer Kit and our comprehensive cheat sheets. Each purchase directly supports this site, ensuring we can continue to offer you quality, distraction-free tutorials.
Identifying DataFrame Differences
Once we’ve determined that the DataFrames are not identical using the equals
method, we can look further to determine the exact differences between the two sets of data. Pandas does not contain a standalone “diff” method for comparison of differences, however we can easily code one ourselves using only a few lines of Python code:
gradeBool = (grades1 != grades2).stack() # Create Frame of comparison booleans
gradediff = pd.concat([grades1.stack()[gradeBool],
grades2.stack()[gradeBool]], axis=1)
gradediff.columns=["Old", "New"]
print(gradediff)
> Old New
> StudentID
> 5540 Midterm 90 92
> 6817 Homework 65 66
> Final 89 88
Let’s break down how the above code successfully compares our two DataFrames for differences.
- The
grades1 andgrades2 DataFrames are the two DataFrames we want to compare in our Python Pandas code. - The
gradeBool DataFrame we created contains Boolean values indicating if the individual elements between the two frames are the same.1 Thestack
method will then convert the DataFrame into a “long” form, where each index-column pair is represented as its own row. - The
gradediff DataFrame we created is a horizontal concatenation of the two original data frames, selecting only the differences identified by thegradeBool DataFrame. Theconcat
function with the optionaxis=1
will indicate that the resulting DataFrame should be the two given DataFrames, joined by index with separate columns. Again, thestack
method is used to convert each DataFrame into a “long” format to ease identification of the differences - The resulting
gradediff DataFrame has no column names to indicate which column contains which set of data, so we set the names directly to “Old” and “New”. Feel free to change the names to something like “Grades1” and “Grades2”.
The resulting difference DataFrame we created makes it easy to identify which index and column between the two DataFrames are different, and it shows us the corresponding value changes.
We can further speed up the process by including all of the above code into a Python function, which we can use to easily calculate DataFrame differences anytime we need to:
import pandas as pd # Make sure import aliases are consistent
def dfDiff(oldFrame, newFrame):
dfBool = (oldFrame != newFrame).stack()
diff = pd.concat([oldFrame.stack()[dfBool],
newFrame.stack()[dfBool]], axis=1)
diff.columns=["Old", "New"]
return diff
print(dfDiff(grades1, grades2))
> Old New
> StudentID
> 5540 Midterm 90 92
> 6817 Homework 65 66
> Final 89 88
Did you find this free tutorial helpful? Share this article with your friends, classmates, and coworkers on Facebook and Twitter! When you spread the word on social media, you’re helping us grow so we can continue to provide free tutorials like this one for years to come.
Code More, Distract Less: Support Our Ad-Free Site
You might have noticed we removed ads from our site - we hope this enhances your learning experience. To help sustain this, please take a look at our Python Developer Kit and our comprehensive cheat sheets. Each purchase directly supports this site, ensuring we can continue to offer you quality, distraction-free tutorials.
-
Note: This method considers individual inequalities, rather than structural inequalities of the data. Thus, if columns or rows are missing/added, then this method may identify groups of equivalent data as unequal. Use this method only if the two DataFrames are congruent. ↩