Pandas Fundamentals VI: Group Operations
April 27, 2026
A function accepts one or more parameters and returns results.
Two key actions:
def, add the function name, parentheses for parameters, and finish the line with a colon :.
my_half() that returns half of its input.lambda <params>: <expression>groupby()
DataFrame.groupby() method creates a DataFrameGroupBy object that facilitates split-apply-combine operations:
DataFrame into separate groups based on unique values in one or more columns.DataFrame.fortune1000_2025.csv contains the 2025 list.import pandas as pd
import numpy as np
# Below is for an interactive display of DataFrame in Colab
from google.colab import data_table
data_table.enable_dataframe_formatter()
fortune1000 = pd.read_csv("https://bcdanl.github.io/data/fortune1000_2025.csv")
fortune = fortune1000[[
"Rank", "Company", "Sector", "Industry",
"Revenues_M", "Profits_M", "Number_of_Employees"
]]groupby()?fortune["Sector"].unique()
in_retailing = fortune["Sector"] == "Retailing"
retail_companies = fortune[ in_retailing ]
retail_companies["Revenues_M"].sum()groupby() does it for us in one line.DataFrameGroupBy Objectgroups — dictionary mapping each group label to the underlying row‑labels.describe() — one‑shot descriptive statistics within each group.agg()fortune.groupby("Sector")["Revenues_M"].mean()
fortune.groupby("Sector")["Revenues_M"].median()
fortune.groupby("Sector")["Revenues_M"].max()
fortune.groupby("Sector")["Revenues_M"].min()
fortune.groupby("Sector")["Revenues_M"].sum()DataFrameGroupBy object.
SeriesGroupBy.size() returns a Series with an alphabetical list of groups and their observation counts, the number of rows in each sector.
groupby(): how many companies fall into each group?size() counts every row in the group. count() counts only non-missing values, so it can differ column by column when NaNs are present.agg() Methodfortune.groupby("Sector")["Revenues_M"] # this is a SeriesGroupBy object
fortune.groupby("Sector")["Revenues_M"].agg('sum')
fortune.groupby("Sector")["Revenues_M"].agg('mean')
fortune.groupby("Sector")["Revenues_M"].agg('max')
fortune.groupby("Sector")["Revenues_M"].agg('min')agg() method can also be used on a SeriesGroupBy.agg() method, and pass the aggregation method we want in there..agg('sum') just duplicates .sum(). So is .agg() actually worth learning?agg()fortune.groupby("Sector").agg(
Revenues_M_min = ("Revenues_M", "min"),
Profits_M_max = ("Profits_M", "max"),
Number_of_Employees_mean = ("Number_of_Employees", "mean")
)agg() method can apply multiple aggregate operations to different variables and can accept a tuple as its argument.
DataFrameGroupBy.agg(VARIABLE_NAME_1 = ("VARIABLE_1", "STATISTICS"), ).agg() pays off: one call, different statistics on different columns, and we name each output column ourselves.| Aggregation | Description |
|---|---|
size() |
Number of observations in each group |
count() |
Number of non-missing values in each group |
nunique() |
Number of unique/distinct values in each group |
value_counts() |
Counts of each unique value for a variable within each group |
mean(), median() |
Mean and median values |
min(), max() |
Minimum and maximum values |
std() |
Standard deviation |
sum() |
Sum of all values |
agg().lambda function that receives a Series and returns a single value:fortune.groupby("Sector").agg(
Revenue_IQR = ("Revenues_M", lambda s: s.quantile(0.75) - s.quantile(0.25) )
)lambda s: ….agg() call:transform()transform() – Return to Original Shapeagg() method, the transform() method can accept the built-in functions (e.g., 'sum', 'mean').agg() method, the transform() method does not collapse DataFrame and goes back to the original index.fortune["Rev_centered"] = (
fortune.groupby("Sector")["Revenues_M"]
.transform(lambda s: s - s.mean())
)transform() with a Custom Functionlambda, we can also define a named UDF and pass it to .transform():DataFrame.assign() and DataFrameGroupBy.transform()sectors = fortune.groupby("Sector")
fortune_new = fortune.assign(
Year = 2025,
nCompanies = sectors.transform('size'),
Revenues_M_min = sectors['Revenues_M'].transform('min'),
Profits_M_max = sectors['Profits_M'].transform('max'),
Number_of_Employees_mean = sectors['Number_of_Employees'].transform('mean')
)assign() can be used to add a new variable to a DataFrame.
assign() is a DataFrame method, not a DataFrameGroupBy method.shift() for Time Comparisonsf_2021_2025 is a long-form DataFrame, containing 705 companies that appeared in the Fortune 1000 list consecutively from 2021 to 2025.The data are sorted by Rank_mean (ascending), then by year (descending).
Here we are interested in calculating the yearly revenue growth rate for each company in fortune_2021_2025 using the following formula: \[
\text{(Revenue Growth Rate)}_{year} = \frac{\text{Revenue}_{year} - \text{Revenue}_{year-1}}{\text{Revenue}_{year-1}}
\]
shift() for Time Comparisonsshift()shift() does in pandas:
shift() moves values up or down along the index (rows).shift(1): shift down (default)shift(-1): shift upshift() for Time Comparisonsf_2021_2025['GrowthRate_Revenue'] = (
(f_2021_2025['Revenues_M'] - f_2021_2025['Revenues_M_last_year']) /
f_2021_2025['Revenues_M_last_year']
)\[ \text{(Revenue Growth Rate)}_{year} = \frac{\text{Revenue}_{year} - \text{Revenue}_{year-1}}{\text{Revenue}_{year-1}} \]
apply()apply() Doesagg() nor transform() can do, like “top‑3 by revenue in each sector”.
apply() and provide a method that operates on each group’s DataFrame.url_path = "https://bcdanl.github.io/data/fortune1000_2025_shuffled.csv"
fortune_shuffled = pd.read_csv(url_path)
sectors = fortune_shuffled.groupby("Sector")sort_values() with groupby()nlargest() with groupby()DataFrameGroupBy has no methods sort_values() or nlargest()..apply():This returns the largest observation(s) within each group.
When we use DataFrame.groupby().apply(), pandas returns a MultiIndex where the first level is the group label and the second level is the original row index.
DataFrameGroupBy object?def get_nlargest_obs(df, n, var):
return df.nlargest(n, var, keep = "all")
sectors.apply(get_nlargest_obs, 2, "Revenues_M")DataFrameGroupBy object, we need to provide the rest of arguments to the apply() method.def filter_by_threshold(df, column, threshold):
return df[ df[column] > threshold ]
sectors.apply(filter_by_threshold, "Revenues_M", 5000)"Revenues_M" and 5000 are passed to filter rows within each group.DataFrameGroupBy object?apply() method!
agg(), transform(), and apply()| Method | Use When… | Result Shape | Example |
|---|---|---|---|
agg() |
You want one summary row per group | One row per group | Average revenue by sector |
transform() |
You want a group statistic added to each original row | Same number of rows as original data | Sector average revenue for each company |
apply() |
You need flexible custom logic | Depends on the function | Top 3 companies in each sector |
# One row per sector
fortune.groupby("Sector").agg(
avg_revenue = ("Revenues_M", "mean")
)
# One value per original row
fortune["sector_avg_revenue"] = (
fortune.groupby("Sector")["Revenues_M"].transform("mean")
)
# Flexible output
fortune.groupby("Sector").apply(
lambda df: df.nlargest(3, "Revenues_M", keep="all")
)Let’s do Classwork 17!
sector_and_industry = fortune.groupby(["Sector", "Industry"])
sector_and_industry.groups
len(sector_and_industry)
sector_and_industry.describe()DataFrameGroupBy object with values from multiple variables.
Let’s do Classwork 18!
I have made every effort to create a meaningful and supportive learning experience in this course.
Your feedback is extremely valuable and helps improve future versions of the course.
I sincerely encourage you to complete the Student Course Experience (SCE) survey.
🔗 https://go.geneseo.edu/scesurveys
Thank you for being part of DANL 210 this semester.
I really appreciate the effort you put into learning Python, working with data, and building new skills.
I hope Python feels a little less intimidating than it did in January, and I hope this course gave you more confidence in handling real-world data in the future.
Thank you again for your participation, feedback, and hard work throughout the semester. It has been a pleasure.
— Byeong-Hak