Group Operations
April 30, 2025
A function accepts one or more parameters and returns results.
Two key actions:
def, add the function name, parentheses for parameters, and finish the line with a colon :.
my_half() that returns half of its input.lambda <params>: <expression>DataFrameGroupBygroupby() : Split-Apply-Combine
DataFrame.groupby() method creates a DataFrameGroupBy object that facilitates split-apply-combine operations:
DataFrame into separate groups based on unique values in one or more columns.DataFrame.fortune1000_2024.csv contains the 2024 list.import pandas as pd
fortune1000 = pd.read_csv("https://bcdanl.github.io/data/fortune1000_2024.csv")
fortune = fortune1000[[
"Rank", "Company", "Sector", "Industry",
"Revenues_M", "Profits_M", "Number_of_Employees"
]]fortune["Sector"].unique()
in_retailing = fortune["Sector"] == "Retailing"
retail_companies = fortune[ in_retailing ]
retail_companies["Revenues_M"].sum()groupby() does it for us in one line.DataFrameGroupBy Methods and Attributesgroups — dictionary mapping each group label to the underlying row‑labels.describe() — one‑shot descriptive statistics within each group.groupby() with agg()DataFrameGroupBy object to apply aggregate operations to every group.
fortune.groupby("Sector").size()
fortune.groupby("Sector").mean()
fortune.groupby("Sector").median()
fortune.groupby("Sector").max()
fortune.groupby("Sector").min()
fortune.groupby("Sector").sum()size() method on the DataFrameGroupBy object returns a Series with an alphabetical list of the groups and their observation counts.fortune.groupby("Sector")["Revenues_M"].mean()
fortune.groupby("Sector")["Revenues_M"].median()
fortune.groupby("Sector")["Revenues_M"].max()
fortune.groupby("Sector")["Revenues_M"].min()
fortune.groupby("Sector")["Revenues_M"].sum()DataFrameGroupBy object.
SeriesGroupBy.agg() Methodfortune.groupby("Sector")["Revenues_M"] # this is a SeriesGroupBy object
fortune.groupby("Sector")["Revenues_M"].agg('sum')
fortune.groupby("Sector")["Revenues_M"].agg('mean')
fortune.groupby("Sector")["Revenues_M"].agg('max')
fortune.groupby("Sector")["Revenues_M"].agg('min')agg() method can also be used on a SeriesGroupBy.agg() method, and pass the aggregation method we want in there.fortune.groupby("Sector").agg(
Revenues_M_min = ("Revenues_M", "min"),
Profits_M_max = ("Profits_M", "max"),
Number_of_Employees_mean = ("Number_of_Employees", "mean")
)agg() method can apply multiple aggregate operations to different variables and can accept a tuple as its argument.
DataFrameGroupBy.agg(VARIABLE_NAME_1 = ("VARIABLE_1", "STATISTICS"), )| Aggregation | Description |
|---|---|
size() |
Number of observations in each group |
count() |
Number of non-missing values in each group |
nunique() |
Number of unique/distinct values in each group |
value_counts() |
Count of unique values for a variable within each group |
mean(), median() |
Mean and median values |
min(), max() |
Minimum and maximum values |
std() |
Standard deviation |
sum() |
Sum of all values |
agg() with a lambda UDFagg().lambda function that receives a Series and returns a single value:fortune.groupby("Sector").agg(
Revenue_IQR = ("Revenues_M", lambda s: s.quantile(0.75) - s.quantile(0.25) )
)lambda s: ….agg() with a Named UDFagg() call:Let’s do Questions 1-3 in Classwork 13!
groupby() with transform()transform() – Return to Original Shapeagg() method, the transform() method can accept the built-in functions (e.g., 'sum', 'mean').agg() method, the transform() method does not collapse DataFrame and goes back to the original index.transform() with a lambda UDFfortune["Rev_centered"] = (
fortune.groupby("Sector")["Revenues_M"]
.transform(lambda s: s - s.mean())
)transform() with a Named UDFlambda, we can also define a named UDF and pass it to .transform():DataFrame.assign() and DataFrameGroupBy.transform()sectors = fortune.groupby("Sector")
fortune_new = fortune.assign(
Year = 2024,
nCompanies = sectors.transform('size'),
Revenues_M_min = sectors['Revenues_M'].transform('min'),
Profits_M_max = sectors['Profits_M'].transform('max'),
Number_of_Employees_mean = sectors['Number_of_Employees'].transform('mean')
)assign() can be used to add a new variable to a DataFrame.
assign() is a DataFrame method, not a DataFrameGroupBy method.transform() with shift()f_2021_2024 is a long-from DataFrame, containing 741 companies that appeared in the Fortune 1000 list consecutively from 2021 to 2024.fortune_2021_2024 using the following formula: \[
\text{(Revenue Growth Rate)}_{year} = \frac{\text{Revenue}_{year} - \text{Revenue}_{year-1}}{\text{Revenue}_{year-1}}
\]shift() does in pandas:
shift() moves values up or down along the index (rows).shift(1): shift down (default)shift(-1): shift upf_2021_2024['GrowthRate_Revenue'] = (
(f_2021_2024['Revenues_M'] - f_2021_2024['Revenues_M_last_year']) /
f_2021_2024['Revenues_M_last_year']
)\[ \text{(Revenue Growth Rate)}_{year} = \frac{\text{Revenue}_{year} - \text{Revenue}_{year-1}}{\text{Revenue}_{year-1}} \]
Let’s do Question 4 in Classwork 13!
groupby() with apply()agg() nor transform() can do, like “top‑3 by revenue in each sector”.
apply() and provide a method that operates on each group’s DataFrame.fortune_shuffled = pd.read_csv("https://bcdanl.github.io/data/fortune1000_2024_shuffled.csv")
sectors = fortune_shuffled.groupby("Sector")sort_values() with groupby()nlargest() with groupby()DataFrameGroupBy has no methods sort_values() or nlargest().lambda UDF with .apply().apply():This returns the largest observation(s) within each group.
When we use DataFrame.groupby().apply(), pandas returns a MultiIndex where the first level is the group label and the second level is the original row index.
DataFrameGroupBy object?def get_nlargest_obs(df, n, var):
return df.nlargest(n, var, keep = "all")
sectors.apply(get_nlargest_obs, 2, "Revenues_M")DataFrameGroupBy object, we need to provide the rest of arguments to the apply() method.def filter_by_threshold(df, column, threshold):
return df[ df[column] > threshold ]
sectors.apply(filter_by_threshold, "Revenues_M", 5000)"Revenues_M" and 5000 are passed to filter rows within each group.DataFrameGroupBy object?apply() method!
agg() vs. transform() vs. apply()agg() with .groupby() when summarizing multiple values into a single aggregated value per group.
transform() with .groupby() when computing one aggregated value per observation.
DataFrame.fortune["Revenue_diff_mean"] = fortune["Revenues_M"] - fortune.groupby("Sector")["Revenues_M"].transform("mean")apply() with .groupby() for custom operations returning multiple observations per group.agg() vs. transform() vs apply()| Method | Purpose | Result Shape | Example Use Case |
|---|---|---|---|
agg() |
Summarize groups | One row per group | Calculate average revenue by sector |
transform() |
Broadcast group-level metrics | Same shape as original DataFrame |
Add a column that repeats each sector’s average revenue for every row |
apply() |
Complex processing | Varies with function | Get top N companies in each sector |
Let’s do Questions 5-6 in Classwork 13!
sector_and_industry = fortune.groupby(["Sector", "Industry"])
sector_and_industry.groups
len(sector_and_industry)
sector_and_industry.describe()DataFrameGroupBy object with values from multiple variables.
agg()transform()apply()pandas for efficient and scalable data wranglingseleniumrequestsLet’s do Classwork 14!