1. Group By Aggregation

What is the groupby() function?

The groupby() function allows you to group data based on selected attributes (one or more columns), Different aggregate functions can be applied to each group independently.

Split-Apply-Combine

The function uses a split-apply-combine approach to aggregate (group) data and apply some sort of function on a group level:

  • The split step involves breaking up and grouping a DataFramedepending on the value of the specified key.

  • The apply step involves computing some function, usually an aggregate, transformation, or filtering, within the individual groups.

  • The combine step merges the results of these operations into an output array.

    Pasted image 20250330210848.png400

We can aggregate grouped data by call the function directly on the grouped object:

df.groupby('Column').agg_function()
Aggregation Function Brief Description General Syntax
sum() Returns the sum of values in each group. df.groupby('column').sum()
mean() Calculates the average of values in each group. df.groupby('column').mean()
count() Counts non-NA/null values in each group. df.groupby('column').count()
min() Returns the minimum value in each group. df.groupby('column').min()
max() Returns the maximum value in each group. df.groupby('column').max()
std() Calculates the standard deviation in each group. df.groupby('column').std()
var() Returns the variance of values in each group. df.groupby('column').var()
first() Returns the first value in each group. df.groupby('column').first()
last() Returns the last value in each group. df.groupby('column').last()
median() Returns the median of values in each group. df.groupby('column').median()
describe() Returns the descriptive statistics of each group. df.groupby('columns').describe()
To return specific statistics df.groupby('column').describe()'count', 'mean'
prod() Returns the product of values in each group. df.groupby('column').prod()
size() Returns the size of each group (including NaN). df.groupby('column').size()
agg()* Allows multiple aggregations at once. df.groupby('column').agg(['sum', 'mean'])
transform()* Returns a DataFrame with the same shape after applying a function. df.groupby('column').transform('mean')
apply()* Allows applying custom functions to each group. df.groupby('column').apply(custom_func)

We can split the data on any of their axes as specified in the axis parameter:

#Vertical, Operation applied on rows (Default)
#This means you group the rows based on the values of the selected columns.
df.groupby(by='column', axis=0)

#Horizontal, Operation applied on columns (not common)
df.groupby(by='column', axis=1)

Special Aggregations

1. Multiple Aggregation with .agg()

The df.groupby().agg() function in Pandas enables multiple aggregation operations on a grouped DataFrame in a single step.

1.1. Applying Multiple Aggregation to Different Columns

df.groupby('column').agg({'column1': 'sum', 'column2': 'mean'})

1.2. Applying Custom Aggregation with Lambda

df.groupby('column1').agg({'column2': lambda x: x.sum() / x.count()})

1.3. Applying Multiple Aggregation on a Single Column

df.groupby('column1')['column2'].agg(['sum', 'mean', 'max'])
          sum  mean  max
Category                  
A           30  15.0   20
B           70  35.0   40

2. Applying custom functions

2.1. Custom Aggregations

Using UDFs and Lambda Expressions, we can create custom aggregation and apply custom transformation on each group in the DataFrame using .apply().

df.groupby('column').apply(custom_function)
How does this Differ from Using .agg()?

.apply() is better at handling more complex operations.

Example:

| Index | Category | Value |
|-------|----------|-------|
| 0     | A        | 10    |
| 1     | A        | 20    |
| 2     | B        | 30    |
| 3     | B        | 40    |
# Custom function
def custom_sum(group):
    return group.sum()

result = df.groupby('Category')['Value'].apply(custom_sum)
print(result)
What Happens During the .groupby().apply()

  1. The df.groupby('Category') splits the DataFrame into groups based on the Category column. This creates a GroupBy Object.
  2. The series from each group under['Value'] is passed to the custom function, and returns a result for each group.
Category

A 30
B 70
Name: Value, dtype: int64

2.2. New Columns Based on Aggregation

df['new_col'] = df['col'].apply(custom_func()) 
Powered by Forestry.md