**What is Pandas groupby used for?**

When dealing with Pandas DataFrames, there are many occasions when we will want to split our data up by some criteria to perform analysis on individual subsets. All database-emulating software provides tools for partitioning data, and for Pandas that tool is the DataFrame `groupby`

method. Users with SQL experience will notice that the `groupby`

method provides a set of operations similar in function to the SQL `GROUP BY`

statement. If you’re comfortable with SQL, it’s probably safe for you to skip this brief introduction on the basics of grouping.

The Pandas `groupby`

method performs three essential functions:

- Splitting the data into groups based on some criteria.
- Applying a function to each group independently.
- Combining the results into a data structure.

To put these three functions into more tangible terms, lets look at an example. Suppose we have the following Pandas DataFrame (

```
import pandas as pd # Don't forget to import pandas!
grades.head() # Look at the first few rows (head) of the data
> Test_1 Test_2 AVG_Grade Type Test_1_Grade Test_2_Grade
> 0 69.688330 50.950581 60.319455 1 D F
> 1 89.621556 77.382319 83.501938 2 B C
> 2 75.655292 90.827847 83.241570 2 C A
> 3 88.438613 86.279996 87.359304 1 B B
> 4 78.734453 51.430116 65.082284 3 C F
grades.info() # Output information about the DataFrame itself
> <class 'pandas.core.frame.DataFrame'>
> RangeIndex: 746 entries, 0 to 745
> Data columns (total 6 columns):
> Test_1 746 non-null float64
> Test_2 746 non-null float64
> AVG_Grade 746 non-null float64
> Type 746 non-null int64
> Test_1_Grade 746 non-null object
> Test_2_Grade 746 non-null object
> dtypes: float64(3), int64(1), object(2)
> memory usage: 35.0+ KB
```

Now that we have the data, we can use Pandas `groupby`

to answer questions like “What’s the average grade for each of the Types?” or “What was the maximum grade of students who made a B?”
First, let’s look at the mechanics of the `groupby`

method and what it produces.

**Grouping by Columns (or features)**

Simply calling the `groupby`

method on a DataFrame executes step 1 of our process: splitting the data into groups based on some criteria. Now we need to consider what criteria we want to use. The Pandas `groupby`

method supports grouping by values contained within a column or index, or the output of a function called on the indices. Most analyses perform group operations on column values, therefore we will focus on that method and leave the more advanced grouping options to a later tutorial.

Using our grades data, let’s group on the `Type`

column by passing the column label to the first `by`

option:

```
TypeGroup = grades.groupby("Type")
type(TypeGroup)
> pandas.core.groupby.generic.DataFrameGroupBy
```

The output of the `groupby`

method is a dedicated `groupby`

object we can use in the remaining steps of our grouping procedure. To understand what the `groupby`

object produces, we will call the `mean`

method to aggregate the data by calculating a mean. We will discuss more aggregation tools later in this tutorial.

```
TypeMeans = TypeGroup.mean()
> Test_1 Test_2 AVG_Grade
> Type
> 1 79.938900 65.962697 72.950798
> 2 79.434124 64.801539 72.117832
> 3 79.558135 65.898646 72.728390
```

Enter your email address for more free Python tutorials and tips.

Let’s break down what we did in our procedure:

- The
grades DataFrame was split by the values*contained*in the`Type`

column - The
`mean`

function was applied to the values within the group for each column - A single result was calculated for each group and column, and the results were stored in the
TypeMeans DataFrame

Note that the columns containing the letter grades were omitted, as the average of a group of non-numeric objects is meaningless.

The `Type`

key we produced was included as the index of the resulting DataFrame. There are some uses for having index labels, however if we would prefer to return the group labels as a column we can set the `as_index`

option to `False`

:

```
grades.groupby("Type", as_index=False).mean()
> Type Test_1 Test_2 AVG_Grade
> 0 1 79.938900 65.962697 72.950798
> 1 2 79.434124 64.801539 72.117832
> 2 3 79.558135 65.898646 72.728390
```

Now we’ve answered one of our questions: what the average grades were for each class `Type`

. Let’s look at how we can split our data by multiple column values.

### Grouping by Multiple Columns (or features)

When calling the `groupby`

method, we can specify multiple columns (multiple features) over which to group the data by passing a list of column labels to the `by`

option (by default, the first input):

```
grades.groupby(["Type", "Test_1_Grade"]).mean()
Test_1 Test_2 AVG_Grade
> Type Test_1_Grade
> 1 A 95.097381 66.191177 80.644279
> B 84.563279 63.884426 74.223853
> C 75.507275 66.969540 71.238408
> D 65.643102 66.880354 66.261728
> F 55.496986 74.204631 64.850808
> 2 A 94.467197 63.051444 78.759320
> B 84.601987 65.286818 74.944403
> C 75.431201 64.794148 70.112675
> D 66.346643 64.248413 65.297528
> F 54.672076 69.601639 62.136858
> 3 A 95.249387 63.881713 79.565550
> B 85.204138 65.584513 75.394325
> C 75.420845 67.192316 71.306581
> D 66.581177 66.615788 66.598483
> F 54.390826 61.281745 57.836285
```

Now the `groupby`

method first splits the data by `Type`

, and data within each `Type`

group is further split into subgroups by the values in the `Test_1_Grade`

column. The averages may look a little funny in our Test_2 column because of how we composed our DataFrame, but the groupby methods successfully performs its job.

**Aggregating By Groups**

So far we’ve explored splitting our data with `groupby`

, but we’ve only looked at a single aggregate function `mean`

to apply to our grouped data. Pandas provides a set of pre-built aggregate functions that will cover the majority of tasks needed to analyze groups of data. For further customization, we can supply our own functions to the `aggregate`

method.

Pandas provides the following `groupby`

object methods:

`describe()` | Calculates summary statistics for each group |

`count()` | Count the number of non-null items in the group in each column |

`size()` | Count the total number of records in each group |

`mean()` | Numerical mean of the values within the group |

`sum()` | Sum all numerical values in each group |

`std()` | Compute standard deviation of each group |

`var()` | Compute variance of each group |

`sem()` | Compute standard error of the mean of each group |

`first()` | Returns the first value from each group |

`last()` | Returns the last value from each group |

`nth(n)` | Returns the `n`-th value from each group |

`min()` | Computes the minimum value from each group |

`max()` | Computes the maximum value from each group |

Enter your email address for more free Python tutorials and tips.

### Exploring with describe

A useful tool for exploring a dataset is the `describe`

method of the `groupby`

object. This method quickly calculates the count, mean, standard deviation, min, quartiles, and max values in one command.

```
TypeGroup = grades[["Test_1", "Type"]].groupby("Type")
TypeGroup.describe()
> Test_1 \
> count mean std min 25% 50% 75%
> Type
> 1 269.0 79.938900 10.006357 51.220563 73.749098 79.813291 87.092941
> 2 233.0 79.434124 10.464932 46.755239 72.040727 80.358792 86.999613
> 3 244.0 79.558135 10.265563 46.167158 72.547642 79.647214 86.912218
>
> max
> Type
> 1 100.0
> 2 100.0
> 3 100.0
```

Notice that we first selected only two columns prior to grouping and using the `describe`

method. This was because the final output will be 8 times the number of original columns, making assessment difficult with multiple output columns.

Depending on your interpreter settings, you may only see a truncated set of columns. To prevent truncation and to observe all columns like the output above, we can modify the following Pandas attribute:

`pandas.set_option('display.max_columns', None)`

### count versus size

In the above table of builtin aggregate functions, you may have noticed that the `count`

function seems to have the same functionality as the `size`

function. The difference is subtle. The `count`

function counts all the values in all the columns, skipping any null values, whereas the `size`

function does a count of all records (rows). We can see the difference with this example:

```
TypeGroup.size() # Creates a single column, including nulls
> Type
> 1 269
> 2 233
> 3 244
> dtype: int64
TypeGroup.count() # Creates a column for each input column, excluding nulls
> Test_1 Test_2 AVG_Grade Test_1_Grade Test_2_Grade
> Type
> 1 269 269 269 269 269
> 2 233 233 233 233 233
> 3 244 244 244 244 244
```

Note, of course, that these values will be the same if there are no null values in the DataFrame.

### Custom Aggregation with aggregate

Now that we’ve looked over some builtin aggregation tools, lets learn how to make our own aggregation functions. Aggregation with a given function is performed using the `aggregate(f)`

method and passing in some function

The function we pass must take a series of data as input and return a single value that is placed in the final grouped DataFrame. If the function output is a Python data structure like a list or dictionary then the object will be passed into the grouped DataFrame. Because these objects are messy to work with within a Pandas DataFrame, it is best to have your aggregate functions return only a single value.

We can replicate the `sum`

method by passing the `sum`

function into the `aggregate`

method:

```
TypeGroup.aggregate(sum)
> Test_1 Test_2 AVG_Grade
> Type
> 1 21503.564099 17743.965485 19623.764792
> 2 18508.150955 15098.758677 16803.454816
> 3 19412.184990 16079.269563 17745.727276
```

We can pass multiple functions with a list and return results with tiered column labels:

```
TypeGroup.aggregate([sum, max])
> Test_1 Test_2 AVG_Grade
> sum max sum max sum max
> Type
> 1 21503.564099 100.0 17743.965485 100.0 19623.764792 92.510511
> 2 18508.150955 100.0 15098.758677 100.0 16803.454816 98.547899
> 3 19412.184990 100.0 16079.269563 100.0 17745.727276 92.536825
```

Note that passing functions in this manner can misapply functions to non-numeric columns which also produced the following:

```
TypeGroup.aggregate([sum, max])
> Test_1_Grade \
> sum max
> Type
> 1 DBCCBCBBCCCBCCABBBDDBCABCDACABCFBCAABBCCDABCBC... F
> 2 BCDCBABBABDCCABCDBBBADABDABBBDCBABCBABBCCFDBBB... F
> 3 CCCCCBCDCABBBBCDCAACCBBCCCFCCCCCBBCBBFBCDCBBBC... F
```

Therefore additional handling of the data and functions should be done if your DataFrame contains mixtures of data types.

For simple functions, we can pass a lambda function:

```
TypeGroup.aggregate(lambda x: sum([y + 2 for y in x]))
> Test_1 Test_2 AVG_Grade
> Type
> 1 22041.564099 18281.965485 20161.764792
> 2 18974.150955 15564.758677 17269.454816
> 3 19900.184990 16567.269563 18233.727276
```

For more complex functions, we can create a named function and pass the name into the `aggregate`

function:

```
def Cumulate2(x):
return(sum(x) + 2)
TypeGroup.aggregate(Cumulate2)
> Test_1 Test_2 AVG_Grade
> Type
> 1 21505.564099 17745.965485 19625.764792
> 2 18510.150955 15100.758677 16805.454816
> 3 19414.184990 16081.269563 17747.727276
```

The Pandas DataFrame `groupby`

method contains far more features and capabilities than we have discussed in this tutorial, so we encourage you to consult the Pandas User Manual for advanced functions, and be on the look out for more of our tutorials on Pandas `groupby`

soon! Subscribe using the form below to make sure you don’t miss it.

Enter your email address for more free Python tutorials and tips.

Once you subscribe, please share what you just learned on Facebook and Twitter!