What is Pandas groupby used for?
When dealing with Pandas DataFrames, there are many occasions when we will want to split our data up by some criteria to perform analysis on individual subsets. All database-emulating software provides tools for partitioning data, and for Pandas that tool is the DataFrame groupby
method. Users with SQL experience will notice that the groupby
method provides a set of operations similar in function to the SQL GROUP BY
statement. If you’re comfortable with SQL, it’s probably safe for you to skip this brief introduction on the basics of grouping.
The Pandas groupby
method performs three essential functions:
- Splitting the data into groups based on some criteria.
- Applying a function to each group independently.
- Combining the results into a data structure.
To put these three functions into more tangible terms, lets look at an example. Suppose we have the following Pandas DataFrame (
import pandas as pd # Don't forget to import pandas!
grades.head() # Look at the first few rows (head) of the data
> Test_1 Test_2 AVG_Grade Type Test_1_Grade Test_2_Grade
> 0 69.688330 50.950581 60.319455 1 D F
> 1 89.621556 77.382319 83.501938 2 B C
> 2 75.655292 90.827847 83.241570 2 C A
> 3 88.438613 86.279996 87.359304 1 B B
> 4 78.734453 51.430116 65.082284 3 C F
grades.info() # Output information about the DataFrame itself
> <class 'pandas.core.frame.DataFrame'>
> RangeIndex: 746 entries, 0 to 745
> Data columns (total 6 columns):
> Test_1 746 non-null float64
> Test_2 746 non-null float64
> AVG_Grade 746 non-null float64
> Type 746 non-null int64
> Test_1_Grade 746 non-null object
> Test_2_Grade 746 non-null object
> dtypes: float64(3), int64(1), object(2)
> memory usage: 35.0+ KB
Now that we have the data, we can use Pandas groupby
to answer questions like “What’s the average grade for each of the Types?” or “What was the maximum grade of students who made a B?”
First, let’s look at the mechanics of the groupby
method and what it produces.
Grouping by Columns (or features)
Simply calling the groupby
method on a DataFrame executes step 1 of our process: splitting the data into groups based on some criteria. Now we need to consider what criteria we want to use. The Pandas groupby
method supports grouping by values contained within a column or index, or the output of a function called on the indices. Most analyses perform group operations on column values, therefore we will focus on that method and leave the more advanced grouping options to a later tutorial.
Using our grades data, let’s group on the Type
column by passing the column label to the first by
option:
TypeGroup = grades.groupby("Type")
type(TypeGroup)
> pandas.core.groupby.generic.DataFrameGroupBy
The output of the groupby
method is a dedicated groupby
object we can use in the remaining steps of our grouping procedure. To understand what the groupby
object produces, we will call the mean
method to aggregate the data by calculating a mean. We will discuss more aggregation tools later in this tutorial.
TypeMeans = TypeGroup.mean()
> Test_1 Test_2 AVG_Grade
> Type
> 1 79.938900 65.962697 72.950798
> 2 79.434124 64.801539 72.117832
> 3 79.558135 65.898646 72.728390
Code More, Distract Less: Support Our Ad-Free Site
You might have noticed we removed ads from our site - we hope this enhances your learning experience. To help sustain this, please take a look at our Python Developer Kit and our comprehensive cheat sheets. Each purchase directly supports this site, ensuring we can continue to offer you quality, distraction-free tutorials.
Let’s break down what we did in our procedure:
- The
grades DataFrame was split by the values contained in theType
column - The
mean
function was applied to the values within the group for each column - A single result was calculated for each group and column, and the results were stored in the
TypeMeans DataFrame
Note that the columns containing the letter grades were omitted, as the average of a group of non-numeric objects is meaningless.
The Type
key we produced was included as the index of the resulting DataFrame. There are some uses for having index labels, however if we would prefer to return the group labels as a column we can set the as_index
option to False
:
grades.groupby("Type", as_index=False).mean()
> Type Test_1 Test_2 AVG_Grade
> 0 1 79.938900 65.962697 72.950798
> 1 2 79.434124 64.801539 72.117832
> 2 3 79.558135 65.898646 72.728390
Now we’ve answered one of our questions: what the average grades were for each class Type
. Let’s look at how we can split our data by multiple column values.
Grouping by Multiple Columns (or features)
When calling the groupby
method, we can specify multiple columns (multiple features) over which to group the data by passing a list of column labels to the by
option (by default, the first input):
grades.groupby(["Type", "Test_1_Grade"]).mean()
Test_1 Test_2 AVG_Grade
> Type Test_1_Grade
> 1 A 95.097381 66.191177 80.644279
> B 84.563279 63.884426 74.223853
> C 75.507275 66.969540 71.238408
> D 65.643102 66.880354 66.261728
> F 55.496986 74.204631 64.850808
> 2 A 94.467197 63.051444 78.759320
> B 84.601987 65.286818 74.944403
> C 75.431201 64.794148 70.112675
> D 66.346643 64.248413 65.297528
> F 54.672076 69.601639 62.136858
> 3 A 95.249387 63.881713 79.565550
> B 85.204138 65.584513 75.394325
> C 75.420845 67.192316 71.306581
> D 66.581177 66.615788 66.598483
> F 54.390826 61.281745 57.836285
Now the groupby
method first splits the data by Type
, and data within each Type
group is further split into subgroups by the values in the Test_1_Grade
column. The averages may look a little funny in our Test_2 column because of how we composed our DataFrame, but the groupby methods successfully performs its job.
Aggregating By Groups
So far we’ve explored splitting our data with groupby
, but we’ve only looked at a single aggregate function mean
to apply to our grouped data. Pandas provides a set of pre-built aggregate functions that will cover the majority of tasks needed to analyze groups of data. For further customization, we can supply our own functions to the aggregate
method.
Pandas provides the following groupby
object methods:
`describe()` | Calculates summary statistics for each group |
`count()` | Count the number of non-null items in the group in each column |
`size()` | Count the total number of records in each group |
`mean()` | Numerical mean of the values within the group |
`sum()` | Sum all numerical values in each group |
`std()` | Compute standard deviation of each group |
`var()` | Compute variance of each group |
`sem()` | Compute standard error of the mean of each group |
`first()` | Returns the first value from each group |
`last()` | Returns the last value from each group |
`nth(n)` | Returns the `n`-th value from each group |
`min()` | Computes the minimum value from each group |
`max()` | Computes the maximum value from each group |
Code More, Distract Less: Support Our Ad-Free Site
You might have noticed we removed ads from our site - we hope this enhances your learning experience. To help sustain this, please take a look at our Python Developer Kit and our comprehensive cheat sheets. Each purchase directly supports this site, ensuring we can continue to offer you quality, distraction-free tutorials.
Exploring with describe
A useful tool for exploring a dataset is the describe
method of the groupby
object. This method quickly calculates the count, mean, standard deviation, min, quartiles, and max values in one command.
TypeGroup = grades[["Test_1", "Type"]].groupby("Type")
TypeGroup.describe()
> Test_1 \
> count mean std min 25% 50% 75%
> Type
> 1 269.0 79.938900 10.006357 51.220563 73.749098 79.813291 87.092941
> 2 233.0 79.434124 10.464932 46.755239 72.040727 80.358792 86.999613
> 3 244.0 79.558135 10.265563 46.167158 72.547642 79.647214 86.912218
>
> max
> Type
> 1 100.0
> 2 100.0
> 3 100.0
Notice that we first selected only two columns prior to grouping and using the describe
method. This was because the final output will be 8 times the number of original columns, making assessment difficult with multiple output columns.
Depending on your interpreter settings, you may only see a truncated set of columns. To prevent truncation and to observe all columns like the output above, we can modify the following Pandas attribute:
pandas.set_option('display.max_columns', None)
count versus size
In the above table of builtin aggregate functions, you may have noticed that the count
function seems to have the same functionality as the size
function. The difference is subtle. The count
function counts all the values in all the columns, skipping any null values, whereas the size
function does a count of all records (rows). We can see the difference with this example:
TypeGroup.size() # Creates a single column, including nulls
> Type
> 1 269
> 2 233
> 3 244
> dtype: int64
TypeGroup.count() # Creates a column for each input column, excluding nulls
> Test_1 Test_2 AVG_Grade Test_1_Grade Test_2_Grade
> Type
> 1 269 269 269 269 269
> 2 233 233 233 233 233
> 3 244 244 244 244 244
Note, of course, that these values will be the same if there are no null values in the DataFrame.
Custom Aggregation with aggregate
Now that we’ve looked over some builtin aggregation tools, lets learn how to make our own aggregation functions. Aggregation with a given function is performed using the aggregate(f)
method and passing in some function
The function we pass must take a series of data as input and return a single value that is placed in the final grouped DataFrame. If the function output is a Python data structure like a list or dictionary then the object will be passed into the grouped DataFrame. Because these objects are messy to work with within a Pandas DataFrame, it is best to have your aggregate functions return only a single value.
We can replicate the sum
method by passing the sum
function into the aggregate
method:
TypeGroup.aggregate(sum)
> Test_1 Test_2 AVG_Grade
> Type
> 1 21503.564099 17743.965485 19623.764792
> 2 18508.150955 15098.758677 16803.454816
> 3 19412.184990 16079.269563 17745.727276
We can pass multiple functions with a list and return results with tiered column labels:
TypeGroup.aggregate([sum, max])
> Test_1 Test_2 AVG_Grade
> sum max sum max sum max
> Type
> 1 21503.564099 100.0 17743.965485 100.0 19623.764792 92.510511
> 2 18508.150955 100.0 15098.758677 100.0 16803.454816 98.547899
> 3 19412.184990 100.0 16079.269563 100.0 17745.727276 92.536825
Note that passing functions in this manner can misapply functions to non-numeric columns which also produced the following:
TypeGroup.aggregate([sum, max])
> Test_1_Grade \
> sum max
> Type
> 1 DBCCBCBBCCCBCCABBBDDBCABCDACABCFBCAABBCCDABCBC... F
> 2 BCDCBABBABDCCABCDBBBADABDABBBDCBABCBABBCCFDBBB... F
> 3 CCCCCBCDCABBBBCDCAACCBBCCCFCCCCCBBCBBFBCDCBBBC... F
Therefore additional handling of the data and functions should be done if your DataFrame contains mixtures of data types.
For simple functions, we can pass a lambda function:
TypeGroup.aggregate(lambda x: sum([y + 2 for y in x]))
> Test_1 Test_2 AVG_Grade
> Type
> 1 22041.564099 18281.965485 20161.764792
> 2 18974.150955 15564.758677 17269.454816
> 3 19900.184990 16567.269563 18233.727276
For more complex functions, we can create a named function and pass the name into the aggregate
function:
def Cumulate2(x):
return(sum(x) + 2)
TypeGroup.aggregate(Cumulate2)
> Test_1 Test_2 AVG_Grade
> Type
> 1 21505.564099 17745.965485 19625.764792
> 2 18510.150955 15100.758677 16805.454816
> 3 19414.184990 16081.269563 17747.727276
The Pandas DataFrame groupby
method contains far more features and capabilities than we have discussed in this tutorial, so we encourage you to consult the Pandas User Manual for advanced functions, and be on the look out for more of our tutorials on Pandas groupby
soon! Subscribe using the form below to make sure you don’t miss it.
Code More, Distract Less: Support Our Ad-Free Site
You might have noticed we removed ads from our site - we hope this enhances your learning experience. To help sustain this, please take a look at our Python Developer Kit and our comprehensive cheat sheets. Each purchase directly supports this site, ensuring we can continue to offer you quality, distraction-free tutorials.
Once you subscribe, please share what you just learned on Facebook and Twitter!