Advanced Group Operations; Reshaping/Joining DataFrames
March 12, 2024
DataFrames
DataFrames
A function can take any number and type of input parameters and return any number and type of output results.
We can do two things with a function:
To define a Python function, we type def
, the function name, parentheses enclosing any input parameters to the function, and then finally, a colon (:
).
Let’s define a very simple function my_half()
that has a parameter x
and returns a value x / 2
.
The values we pass into the function when we call it are known as arguments.
When we call a function with arguments, the values of those arguments are copied to their corresponding parameters inside the function.
fortune_2023.csv
file is a collection of Fortune 1000 companies from 2023 (Source: Kaggle).import pandas as pd
# Below is for an interactive display of DataFrame in Colab
from google.colab import data_table
data_table.enable_dataframe_formatter()
fortune1000 = pd.read_csv("https://bcdanl.github.io/data/fortune_2023.csv")
varlist = ['Company', 'Revenues_M', 'Profits_M', 'Number_of_employees', 'Sector', 'Industry']
fortune = fortune1000[varlist]
GroupBy
object from a datasetSector
value of "Retailing"
.
in_retailing = ( fortune["Sector"] == "Retailing" )
retail_companies = fortune[ in_retailing ]
retail_companies["Revenues"].mean()
fortune
.
GroupBy
object offers the best solution out of the box.GroupBy
object from a datasetWe can count the number of groups in sectors by passing the GroupBy
object into the Python’s built-in len()
function
The size()
method on the GroupBy
object returns a Series
with an alphabetical list of the groups and their observation counts.
GroupBy
objectgroups
attribute stores a dictionary with associations of group-to-observations.GroupBy
objectGroupBy
object’s first()/last()
method extracts the first/last observation listed for each group in a DataFrame
.
fortune
DataFrame
is sorted by Revenue_M
, the first company pulled out for each sector will be the highest-performing company within that sector.GroupBy
objectnth()
method is used with a GroupBy
object to select the nth observation from each group.
fortune
.GroupBy
objecthead(n)/tail(n)
method extracts the first/last n
observations from each group.GroupBy
objectget_group()
method to extract all observations in a given group.
DataFrame
containing the observations.GroupBy
object to apply aggregate operations to every group.
sectors["Revenues_M"]
sectors["Revenues_M"].sum()
sectors["Revenues_M"].mean()
sectors["Revenues_M"].max()
sectors["Revenues_M"].min()
GroupBy
object.
SeriesGroupBy
.sectors["Revenues_M"] # this is a SeriesGroupBy object
sectors["Revenues_M"].agg('sum')
sectors["Revenues_M"].agg('mean')
sectors["Revenues_M"].agg('max')
sectors["Revenues_M"].agg('min')
agg()
method can also be used on a SeriesGroupBy
.agg()
method, and pass the aggregation method we want in there.sectors.agg(
Revenues_M_min = ("Revenues_M", "min"),
Profits_M_max = ("Profits_M", "max"),
Number_of_employees_mean = ("Revenues_M", "mean")
)
agg()
method can apply multiple aggregate operations to different variables and can accept a tuple as its argument.
Let’s do Questions 1-6 in Part 1 of Classwork 6!
GroupBy.transform()
Just like the agg()
method, the transform()
method can accept the aggregation method (e.g., 'sum'
, 'mean'
).
Unlike the agg()
method, the transform()
method does not collapse DataFrame
and goes back to the original index.
GroupBy.transform()
fortune['Revenues_M_min'] = sectors['Revenues_M'].transform('min')
fortune['Profits_M_max'] = sectors['Profits_M'].transform('max')
fortune['Number_of_employees_mean'] = sectors['Number_of_employees'].transform('mean')
transform()
method returns a Series
with the index label that is the same as in the original DataFrame
, it can be used to add a new variable to the original DataFrame
.DataFrame.assign()
and GroupBy.transform()
sectors = fortune.groupby("Sector")
fortune = fortune.assign(
Revenues_M_min = sectors['Revenues_M'].transform('min'),
Profits_M_max = sectors['Profits_M'].transform('max'),
Number_of_employees_mean = sectors['Number_of_employees'].transform('mean')
)
assign()
method allows us to create new variables or overwrite existing ones in a DataFrame
DataFrame
with the new variables added.assign()
method can be useful for method chaining.agg()
vs. transform()
agg()
with groupby()
when you want to take multiple values and return a single (aggregated) value for each group.
agg()
does collapse the DataFrame
, and groups become the new index labels.transform()
with groupby()
when you want to perform computations on your groups but you want to return a single (aggregated) value for each observation.
transform()
does not collapse the DataFrame
.Let’s do Question 7 of Classwork 7!
GroupBy
object.
fortune
.sample(frac=1)
: This samples 100% of the observations from the DataFrame
, effectively shuffling it.
frac
parameter specifies the fraction of observations to return in the random sample, so frac=1
means “return all observations”.GroupBy
object.
fortune
.reset_index()
: This resets the index of the DataFrame
.
drop = True
option is used to prevent the old index from being added as a variable in the new DataFrame
.sort_values()
with groupby()
nlargest()
with groupby()
DataFrameGroupBy
has no methods sort_values()
or nlargest()
.nlargest()
method to each group?
DataFrame
.apply()
method the custom function.def get_largest_obs(df):
return df.nlargest(1, "Revenues_M", keep="all")
sectors.apply(get_largest_obs)
apply()
method when pandas does not support a custom aggregation we would like to apply to each group.get_largest_obs
function as a lambda function with the apply()
method, we can do the following.GroupBy
object with values from multiple variables.
The GroupBy
object’s size()
method now returns a MultiIndex Series
with a count of observations for each internal group.
The reset_index()
method can be used to convert a Series
into a DataFrame
.
name
option renames a variable of index when resetting index.get_group()
method requires a tuple of values to extract a nested DataFrame
from the GroupBy
collection.sector_and_industry.sum()
sector_and_industry["Revenues_M"].mean()
(
sector_and_industry["Revenues_M"]
.mean()
.reset_index(name = "Revenues_mean")
)
MultiIndex DataFrame
with the calculations.Let’s do Question 7 in Part 1 of Classwork 6!
DataFrames
DataFrames
DataFrames
DataFrame
of patient information, each observation could correspond to a single patient’s data record.DataFrames
DataFrames
DataFrame
tidy:
DataFrames
A DataFrame
can be given in a format unsuited for the analysis that we would like to perform on it.
DataFrame
may have larger structural problems that extend beyond the data.DataFrame
stores its values in a format that makes it easy to extract a single row but difficult to aggregate the data.Reshaping a DataFrame
means manipulating it into a different shape.
In this section, we will discuss pandas techniques for molding DataFrame
into the shapes we desire.
DataFrames
DataFrames
DataFrames
measure temperatures in two cities over two days.df_wide = pd.DataFrame({
'Weekday': ['Tuesday', 'Wednesday'],
'Miami': [80, 83],
'Rochester': [57, 62],
'St. Louis': [71, 75]
})
df_long = pd.DataFrame({
'Weekday': ['Tuesday', 'Wednesday', 'Tuesday', 'Wednesday', 'Tuesday', 'Wednesday'],
'City': ['Miami', 'Miami', 'Rochester', 'Rochester', 'St. Louis', 'St. Louis'],
'Temperature': [80, 83, 57, 62, 71, 75]
})
DataFrames
DataFrames
DataFrame
can store its values in wide or long format.DataFrame
increases in width.DataFrame
increases in height.DataFrame
depends on the insight we are trying to glean from it.
DataFrames
wider if one observation is spread across multiple rows.DataFrames
longer if one variable is spread across multiple columns.DataFrames
melt()
and pivot()
melt()
makes DataFrame
longer.pivot()
and pivot_table()
make DataFrame
wider.DataFrames
DataFrame
Longer with melt()
DataFrames
DataFrame
Longer with melt()
melt()
can take a few parameters:
id_vars
is a container (string
, list
, tuple
, or array
) that represents the variables that will remain as is.id_vars
can indicate which column should be the “identifier”.DataFrames
DataFrame
Longer with melt()
df_wide_to_long = (
df_wide
.melt(id_vars = "Weekday",
var_name = "City",
value_name = "Temperature")
)
melt()
can take a few parameters:
var_name
is a string
for the new column name for the variable.value_name
is a string
for the new column name that represents the values for the var_name
.DataFrames
DataFrame
Longer with melt()
df_wide_to_long = (
df_wide
.melt(id_vars = "Weekday",
var_name = "City",
value_name = "Temperature",
value_vars = ['Miami', 'Rochester'])
)
melt()
can take a few parameters:
value_vars
parameter allows us to select which specific columns we want to “melt”.id_vars
parameter.DataFrames
DataFrame
Wider with pivot()
df_long_to_wide = (
df_long
.pivot(index = "Weekday",
columns = "City",
values = "Temperature" # To avoid having MultiIndex
)
.reset_index()
)
pivot()
, we need to specify a few parameters:
index
that takes the column to pivot on;columns
that takes the column to be used to make the new variable names of the wider DataFrame
;values
that takes the column that provides the values of the variables in the wider DataFrame
.DataFrames
DataFrame
, df
, containing information about the number of courses each student took from each department in each year.dict_data = {"Name": ["Donna", "Donna", "Mike", "Mike"],
"Department": ["ECON", "DANL", "ECON", "DANL"],
"2018": [1, 2, 3, 1],
"2019": [2, 3, 4, 2],
"2020": [5, 1, 2, 2]}
df = pd.DataFrame(dict_data)
df_longer = df.melt(id_vars=["Name", "Department"],
var_name="Year",
value_name="Number of Courses")
pivot_table()
method can take both a string
and a list
of variables for the index
parameter.
pivot()
can take only a string
for index
.DataFrames
DataFrame
, df
, containing information about the number of courses each student took from each department in each year.dict_data = {"Name": ["Donna", "Donna", "Mike", "Mike"],
"Department": ["ECON", "DANL", "ECON", "DANL"],
"2018": [1, 2, 3, 1],
"2019": [2, 3, 4, 2],
"2020": [5, 1, 2, 2]}
df = pd.DataFrame(dict_data)
df_longer = df.melt(id_vars=["Name", "Department"],
var_name="Year",
value_name="Number of Courses")
Q. How can we use the df_longer
to create the wide-form DataFrame
, df_wider
, which is equivalent to the df
?
DataFrames
Let’s do Part 2 of Classwork 6!
DataFrames
DataFrames
DataFrame
for county-level data and DataFrame
for geographic information, such as longitude and latitude.DataFrames
based on common data values in those DataFrames
.
merge()
method in Pandas.DataFrames
Why is one data set sometimes scattered across multiple files?
Sometimes we may have two or more DataFrames
that we want to combine based on common data values.
merge()
method in Pandas.DataFrames
DataFrames
with merge()
DataFrames
with merge()
x
.
y
.
DataFrames
with merge()
DataFrame
has duplicate keys (a one-to-many relationship).
DataFrames
with merge()
left_on
and right_on
parameters instead.DataFrames
with merge()
Let’s do Part 3 of Classwork 6!