Group Operations
April 30, 2025
A function accepts one or more parameters and returns results.
Two key actions:
def
, add the function name, parentheses for parameters, and finish the line with a colon :
.
my_half()
that returns half of its input.lambda <params>: <expression>
DataFrameGroupBy
groupby()
: Split-Apply-Combine
DataFrame.groupby()
method creates a DataFrameGroupBy
object that facilitates split-apply-combine operations:
DataFrame
into separate groups based on unique values in one or more columns.DataFrame
.fortune1000_2024.csv
contains the 2024 list.import pandas as pd
fortune1000 = pd.read_csv("https://bcdanl.github.io/data/fortune1000_2024.csv")
fortune = fortune1000[[
"Rank", "Company", "Sector", "Industry",
"Revenues_M", "Profits_M", "Number_of_Employees"
]]
fortune["Sector"].unique()
in_retailing = fortune["Sector"] == "Retailing"
retail_companies = fortune[ in_retailing ]
retail_companies["Revenues_M"].sum()
groupby()
does it for us in one line.DataFrameGroupBy
Methods and Attributesgroups
— dictionary mapping each group label to the underlying row‑labels.describe()
— one‑shot descriptive statistics within each group.groupby()
with agg()
DataFrameGroupBy
object to apply aggregate operations to every group.
fortune.groupby("Sector").size()
fortune.groupby("Sector").mean()
fortune.groupby("Sector").median()
fortune.groupby("Sector").max()
fortune.groupby("Sector").min()
fortune.groupby("Sector").sum()
size()
method on the DataFrameGroupBy
object returns a Series
with an alphabetical list of the groups and their observation counts.fortune.groupby("Sector")["Revenues_M"].mean()
fortune.groupby("Sector")["Revenues_M"].median()
fortune.groupby("Sector")["Revenues_M"].max()
fortune.groupby("Sector")["Revenues_M"].min()
fortune.groupby("Sector")["Revenues_M"].sum()
DataFrameGroupBy
object.
SeriesGroupBy
.agg()
Methodfortune.groupby("Sector")["Revenues_M"] # this is a SeriesGroupBy object
fortune.groupby("Sector")["Revenues_M"].agg('sum')
fortune.groupby("Sector")["Revenues_M"].agg('mean')
fortune.groupby("Sector")["Revenues_M"].agg('max')
fortune.groupby("Sector")["Revenues_M"].agg('min')
agg()
method can also be used on a SeriesGroupBy
.agg()
method, and pass the aggregation method we want in there.fortune.groupby("Sector").agg(
Revenues_M_min = ("Revenues_M", "min"),
Profits_M_max = ("Profits_M", "max"),
Number_of_Employees_mean = ("Number_of_Employees", "mean")
)
agg()
method can apply multiple aggregate operations to different variables and can accept a tuple as its argument.
DataFrameGroupBy.agg(VARIABLE_NAME_1 = ("VARIABLE_1", "STATISTICS"), )
Aggregation | Description |
---|---|
size() |
Number of observations in each group |
count() |
Number of non-missing values in each group |
nunique() |
Number of unique/distinct values in each group |
value_counts() |
Count of unique values for a variable within each group |
mean() , median() |
Mean and median values |
min() , max() |
Minimum and maximum values |
std() |
Standard deviation |
sum() |
Sum of all values |
agg()
with a lambda
UDFagg()
.lambda
function that receives a Series and returns a single value:fortune.groupby("Sector").agg(
Revenue_IQR = ("Revenues_M", lambda s: s.quantile(0.75) - s.quantile(0.25) )
)
lambda s: …
.agg()
with a Named UDFagg()
call:Let’s do Questions 1-3 in Classwork 13!
groupby()
with transform()
transform()
– Return to Original Shapeagg()
method, the transform()
method can accept the built-in functions (e.g., 'sum'
, 'mean'
).agg()
method, the transform()
method does not collapse DataFrame
and goes back to the original index.transform()
with a lambda
UDFfortune["Rev_centered"] = (
fortune.groupby("Sector")["Revenues_M"]
.transform(lambda s: s - s.mean())
)
transform()
with a Named UDFlambda
, we can also define a named UDF and pass it to .transform()
:DataFrame.assign()
and DataFrameGroupBy.transform()
sectors = fortune.groupby("Sector")
fortune_new = fortune.assign(
Year = 2024,
nCompanies = sectors.transform('size'),
Revenues_M_min = sectors['Revenues_M'].transform('min'),
Profits_M_max = sectors['Profits_M'].transform('max'),
Number_of_Employees_mean = sectors['Number_of_Employees'].transform('mean')
)
assign()
can be used to add a new variable to a DataFrame
.
assign()
is a DataFrame
method, not a DataFrameGroupBy
method.transform()
with shift()
f_2021_2024
is a long-from DataFrame, containing 741 companies that appeared in the Fortune 1000 list consecutively from 2021 to 2024.fortune_2021_2024
using the following formula: \[
\text{(Revenue Growth Rate)}_{year} = \frac{\text{Revenue}_{year} - \text{Revenue}_{year-1}}{\text{Revenue}_{year-1}}
\]shift()
does in pandas:
shift()
moves values up or down along the index (rows).shift(1)
: shift down (default)shift(-1)
: shift upf_2021_2024['GrowthRate_Revenue'] = (
(f_2021_2024['Revenues_M'] - f_2021_2024['Revenues_M_last_year']) /
f_2021_2024['Revenues_M_last_year']
)
\[ \text{(Revenue Growth Rate)}_{year} = \frac{\text{Revenue}_{year} - \text{Revenue}_{year-1}}{\text{Revenue}_{year-1}} \]
Let’s do Question 4 in Classwork 13!
groupby()
with apply()
agg()
nor transform()
can do, like “top‑3 by revenue in each sector”.
apply()
and provide a method that operates on each group’s DataFrame
.fortune_shuffled = pd.read_csv("https://bcdanl.github.io/data/fortune1000_2024_shuffled.csv")
sectors = fortune_shuffled.groupby("Sector")
sort_values()
with groupby()
nlargest()
with groupby()
DataFrameGroupBy
has no methods sort_values()
or nlargest()
.lambda
UDF with .apply()
.apply()
:This returns the largest observation(s) within each group.
When we use DataFrame.groupby().apply()
, pandas returns a MultiIndex where the first level is the group label and the second level is the original row index.
DataFrameGroupBy
object?def get_nlargest_obs(df, n, var):
return df.nlargest(n, var, keep = "all")
sectors.apply(get_nlargest_obs, 2, "Revenues_M")
DataFrameGroupBy
object, we need to provide the rest of arguments to the apply()
method.def filter_by_threshold(df, column, threshold):
return df[ df[column] > threshold ]
sectors.apply(filter_by_threshold, "Revenues_M", 5000)
"Revenues_M"
and 5000
are passed to filter rows within each group.DataFrameGroupBy
object?apply()
method!
agg()
vs. transform()
vs. apply()
agg()
with .groupby()
when summarizing multiple values into a single aggregated value per group.
transform()
with .groupby()
when computing one aggregated value per observation.
DataFrame
.fortune["Revenue_diff_mean"] = fortune["Revenues_M"] - fortune.groupby("Sector")["Revenues_M"].transform("mean")
apply()
with .groupby()
for custom operations returning multiple observations per group.agg()
vs. transform()
vs apply()
Method | Purpose | Result Shape | Example Use Case |
---|---|---|---|
agg() |
Summarize groups | One row per group | Calculate average revenue by sector |
transform() |
Broadcast group-level metrics | Same shape as original DataFrame |
Add a column that repeats each sector’s average revenue for every row |
apply() |
Complex processing | Varies with function | Get top N companies in each sector |
Let’s do Questions 5-6 in Classwork 13!
sector_and_industry = fortune.groupby(["Sector", "Industry"])
sector_and_industry.groups
len(sector_and_industry)
sector_and_industry.describe()
DataFrameGroupBy
object with values from multiple variables.
agg()
transform()
apply()
pandas
for efficient and scalable data wranglingselenium
requests
Let’s do Classwork 14!