What is Pandas groupby used for?

When dealing with Pandas DataFrames, there are many occasions when we will want to split our data up by some criteria to perform analysis on individual subsets. All database-emulating software provides tools for partitioning data, and for Pandas that tool is the DataFrame groupby method. Users with SQL experience will notice that the groupby method provides a set of operations similar in function to the SQL GROUP BY statement. If you’re comfortable with SQL, it’s probably safe for you to skip this brief introduction on the basics of grouping.

The Pandas groupby method performs three essential functions:

  1. Splitting the data into groups based on some criteria.
  2. Applying a function to each group independently.
  3. Combining the results into a data structure.

To put these three functions into more tangible terms, lets look at an example. Suppose we have the following Pandas DataFrame (grades) already prepared. This DataFrame contains the test grades of 747 students split into 3 classes:

import pandas as pd # Don't forget to import pandas!
grades.head()  # Look at the first few rows (head) of the data
>       Test_1     Test_2  AVG_Grade  Type Test_1_Grade Test_2_Grade
> 0  69.688330  50.950581  60.319455     1            D            F
> 1  89.621556  77.382319  83.501938     2            B            C
> 2  75.655292  90.827847  83.241570     2            C            A
> 3  88.438613  86.279996  87.359304     1            B            B
> 4  78.734453  51.430116  65.082284     3            C            F

grades.info()  # Output information about the DataFrame itself
> <class 'pandas.core.frame.DataFrame'>
> RangeIndex: 746 entries, 0 to 745
> Data columns (total 6 columns):
> Test_1          746 non-null float64
> Test_2          746 non-null float64
> AVG_Grade       746 non-null float64
> Type            746 non-null int64
> Test_1_Grade    746 non-null object
> Test_2_Grade    746 non-null object
> dtypes: float64(3), int64(1), object(2)
> memory usage: 35.0+ KB

Now that we have the data, we can use Pandas groupby to answer questions like “What’s the average grade for each of the Types?” or “What was the maximum grade of students who made a B?” First, let’s look at the mechanics of the groupby method and what it produces.

Grouping by Columns (or features)

Simply calling the groupby method on a DataFrame executes step 1 of our process: splitting the data into groups based on some criteria. Now we need to consider what criteria we want to use. The Pandas groupby method supports grouping by values contained within a column or index, or the output of a function called on the indices. Most analyses perform group operations on column values, therefore we will focus on that method and leave the more advanced grouping options to a later tutorial.

Using our grades data, let’s group on the Type column by passing the column label to the first by option:

TypeGroup = grades.groupby("Type")
> pandas.core.groupby.generic.DataFrameGroupBy

The output of the groupby method is a dedicated groupby object we can use in the remaining steps of our grouping procedure. To understand what the groupby object produces, we will call the mean method to aggregate the data by calculating a mean. We will discuss more aggregation tools later in this tutorial.

TypeMeans = TypeGroup.mean()
>          Test_1     Test_2  AVG_Grade
> Type                                 
> 1     79.938900  65.962697  72.950798
> 2     79.434124  64.801539  72.117832
> 3     79.558135  65.898646  72.728390

Get Our Python Developer Kit for Free

I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.

Yes, I'll take a free Python Developer Kit

Let’s break down what we did in our procedure:

  1. The grades DataFrame was split by the values contained in the Type column
  2. The mean function was applied to the values within the group for each column
  3. A single result was calculated for each group and column, and the results were stored in the TypeMeans DataFrame

Note that the columns containing the letter grades were omitted, as the average of a group of non-numeric objects is meaningless.

The Type key we produced was included as the index of the resulting DataFrame. There are some uses for having index labels, however if we would prefer to return the group labels as a column we can set the as_index option to False:

grades.groupby("Type", as_index=False).mean()
>    Type     Test_1     Test_2  AVG_Grade
> 0     1  79.938900  65.962697  72.950798
> 1     2  79.434124  64.801539  72.117832
> 2     3  79.558135  65.898646  72.728390

Now we’ve answered one of our questions: what the average grades were for each class Type. Let’s look at how we can split our data by multiple column values.

Grouping by Multiple Columns (or features)

When calling the groupby method, we can specify multiple columns (multiple features) over which to group the data by passing a list of column labels to the by option (by default, the first input):

grades.groupby(["Type", "Test_1_Grade"]).mean()
                      Test_1     Test_2  AVG_Grade
> Type Test_1_Grade                                 
> 1    A             95.097381  66.191177  80.644279
>      B             84.563279  63.884426  74.223853
>      C             75.507275  66.969540  71.238408
>      D             65.643102  66.880354  66.261728
>      F             55.496986  74.204631  64.850808
> 2    A             94.467197  63.051444  78.759320
>      B             84.601987  65.286818  74.944403
>      C             75.431201  64.794148  70.112675
>      D             66.346643  64.248413  65.297528
>      F             54.672076  69.601639  62.136858
> 3    A             95.249387  63.881713  79.565550
>      B             85.204138  65.584513  75.394325
>      C             75.420845  67.192316  71.306581
>      D             66.581177  66.615788  66.598483
>      F             54.390826  61.281745  57.836285

Now the groupby method first splits the data by Type, and data within each Type group is further split into subgroups by the values in the Test_1_Grade column. The averages may look a little funny in our Test_2 column because of how we composed our DataFrame, but the groupby methods successfully performs its job.

Aggregating By Groups

So far we’ve explored splitting our data with groupby, but we’ve only looked at a single aggregate function mean to apply to our grouped data. Pandas provides a set of pre-built aggregate functions that will cover the majority of tasks needed to analyze groups of data. For further customization, we can supply our own functions to the aggregate method.

Pandas provides the following groupby object methods:

`describe()` Calculates summary statistics for each group
`count()` Count the number of non-null items in the group in each column
`size()` Count the total number of records in each group
`mean()` Numerical mean of the values within the group
`sum()` Sum all numerical values in each group
`std()` Compute standard deviation of each group
`var()` Compute variance of each group
`sem()` Compute standard error of the mean of each group
`first()` Returns the first value from each group
`last()` Returns the last value from each group
`nth(n)` Returns the `n`-th value from each group
`min()` Computes the minimum value from each group
`max()` Computes the maximum value from each group

Get Our Python Developer Kit for Free

I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.

Yes, I'll take a free Python Developer Kit

Exploring with describe

A useful tool for exploring a dataset is the describe method of the groupby object. This method quickly calculates the count, mean, standard deviation, min, quartiles, and max values in one command.

TypeGroup = grades[["Test_1", "Type"]].groupby("Type")

>      Test_1                                                                    \
>       count       mean        std        min        25%        50%        75%   
> Type                                                                            
> 1     269.0  79.938900  10.006357  51.220563  73.749098  79.813291  87.092941   
> 2     233.0  79.434124  10.464932  46.755239  72.040727  80.358792  86.999613   
> 3     244.0  79.558135  10.265563  46.167158  72.547642  79.647214  86.912218   
>         max  
> Type         
> 1     100.0  
> 2     100.0  
> 3     100.0

Notice that we first selected only two columns prior to grouping and using the describe method. This was because the final output will be 8 times the number of original columns, making assessment difficult with multiple output columns.

Depending on your interpreter settings, you may only see a truncated set of columns. To prevent truncation and to observe all columns like the output above, we can modify the following Pandas attribute:

pandas.set_option('display.max_columns', None)

count versus size

In the above table of builtin aggregate functions, you may have noticed that the count function seems to have the same functionality as the size function. The difference is subtle. The count function counts all the values in all the columns, skipping any null values, whereas the size function does a count of all records (rows). We can see the difference with this example:

TypeGroup.size() # Creates a single column, including nulls
> Type
> 1    269
> 2    233
> 3    244
> dtype: int64

TypeGroup.count() # Creates a column for each input column, excluding nulls
>       Test_1  Test_2  AVG_Grade  Test_1_Grade  Test_2_Grade
> Type                                                       
> 1        269     269        269           269           269
> 2        233     233        233           233           233
> 3        244     244        244           244           244

Note, of course, that these values will be the same if there are no null values in the DataFrame.

Custom Aggregation with aggregate

Now that we’ve looked over some builtin aggregation tools, lets learn how to make our own aggregation functions. Aggregation with a given function is performed using the aggregate(f) method and passing in some function f.

The function we pass must take a series of data as input and return a single value that is placed in the final grouped DataFrame. If the function output is a Python data structure like a list or dictionary then the object will be passed into the grouped DataFrame. Because these objects are messy to work with within a Pandas DataFrame, it is best to have your aggregate functions return only a single value.

We can replicate the sum method by passing the sum function into the aggregate method:

>             Test_1        Test_2     AVG_Grade
> Type                                          
> 1     21503.564099  17743.965485  19623.764792
> 2     18508.150955  15098.758677  16803.454816
> 3     19412.184990  16079.269563  17745.727276

We can pass multiple functions with a list and return results with tiered column labels:

TypeGroup.aggregate([sum, max])
>             Test_1               Test_2            AVG_Grade             
>                sum    max           sum    max           sum        max   
> Type                                                                      
> 1     21503.564099  100.0  17743.965485  100.0  19623.764792  92.510511   
> 2     18508.150955  100.0  15098.758677  100.0  16803.454816  98.547899   
> 3     19412.184990  100.0  16079.269563  100.0  17745.727276  92.536825

Note that passing functions in this manner can misapply functions to non-numeric columns which also produced the following:

TypeGroup.aggregate([sum, max])
>                                            Test_1_Grade      \
>                                                     sum max   
> Type                                                          

Therefore additional handling of the data and functions should be done if your DataFrame contains mixtures of data types.

For simple functions, we can pass a lambda function:

TypeGroup.aggregate(lambda x: sum([y + 2 for y in x]))
>             Test_1        Test_2     AVG_Grade
> Type                                          
> 1     22041.564099  18281.965485  20161.764792
> 2     18974.150955  15564.758677  17269.454816
> 3     19900.184990  16567.269563  18233.727276

For more complex functions, we can create a named function and pass the name into the aggregate function:

def Cumulate2(x):
	return(sum(x) + 2)
>             Test_1        Test_2     AVG_Grade
> Type                                          
> 1     21505.564099  17745.965485  19625.764792
> 2     18510.150955  15100.758677  16805.454816
> 3     19414.184990  16081.269563  17747.727276

The Pandas DataFrame groupby method contains far more features and capabilities than we have discussed in this tutorial, so we encourage you to consult the Pandas User Manual for advanced functions, and be on the look out for more of our tutorials on Pandas groupby soon! Subscribe using the form below to make sure you don’t miss it.

Get Our Python Developer Kit for Free

I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.

Yes, I'll take a free Python Developer Kit

Once you subscribe, please share what you just learned on Facebook and Twitter!