Lecture 6

Advanced Group Operations; Reshaping/Joining DataFrames

Byeong-Hak Choe

bchoe@geneseo.edu

SUNY Geneseo

March 12, 2024

Learning Objectives

Primer on Custom Functions
Group Operations
Reshaping DataFrames
Joining DataFrames

Primer on Custom Functions

Python Basics

Primer on Custom Functions

A function can take any number and type of input parameters and return any number and type of output results.
We can do two things with a function:
- Define it
- Call it

Python Basics

Primer on Custom Functions

To define a Python function, we type def, the function name, parentheses enclosing any input parameters to the function, and then finally, a colon (:).
Let’s define a very simple function my_half() that has a parameter x and returns a value x / 2.

def my_half(x):
    return x / 2
    
my_half(2)

The values we pass into the function when we call it are known as arguments.
When we call a function with arguments, the values of those arguments are copied to their corresponding parameters inside the function.

Python Basics

Primer on Custom Functions - Lambda Functions

A Python lambda function is an anonymous function expressed as a single statement.
- It is defined with the lambda keyword, which has no meaning other than “we are declaring an anonymous function”:

def my_half(x):
    return x / 2
    
equiv_my_half = lambda x: x / 2

my_half(2)
equiv_my_half(2)

Group Operations

Fortune 1000 dataset

The Fortune 1000 is a listing of the 1,000 largest companies in the United States by revenue.
- The list is updated annually by the business magazine Fortune.
- The fortune_2023.csv file is a collection of Fortune 1000 companies from 2023 (Source: Kaggle).

import pandas as pd
# Below is for an interactive display of DataFrame in Colab
from google.colab import data_table
data_table.enable_dataframe_formatter()
fortune1000 = pd.read_csv("https://bcdanl.github.io/data/fortune_2023.csv")
varlist = ['Company', 'Revenues_M', 'Profits_M', 'Number_of_employees', 'Sector', 'Industry']
fortune = fortune1000[varlist]

A sector can have many companies; An industry is a subcategory within a sector.

Group Operations

Creating a `GroupBy` object from a dataset

Let’s pull out all companies with a Sector value of "Retailing".
- Then, calculate the Retailing sector’s average revenue:

in_retailing = ( fortune["Sector"] == "Retailing" )
retail_companies = fortune[ in_retailing ]
retail_companies["Revenues"].mean()

Without a group operation, we may need to write a lot of additional code to apply the same logic to the other 20 sectors in fortune.
- Pandas’ GroupBy object offers the best solution out of the box.

Group Operations

Creating a `GroupBy` object from a dataset

sectors = fortune.groupby("Sector")
len(sectors)    # fortune["Sector"].nunique()
sectors.size()

We can count the number of groups in sectors by passing the GroupBy object into the Python’s built-in len() function
The size() method on the GroupBy object returns a Series with an alphabetical list of the groups and their observation counts.

Group Operations

Attributes and methods of a `GroupBy` object

sectors.groups

The groups attribute stores a dictionary with associations of group-to-observations.

Group Operations

Attributes and methods of a `GroupBy` object

What if we want to find the highest-performing company (by revenue) within each sector?

sectors.first()
sectors.last()

The GroupBy object’s first()/last() method extracts the first/last observation listed for each group in a DataFrame.
- Since our fortune DataFrame is sorted by Revenue_M, the first company pulled out for each sector will be the highest-performing company within that sector.

Group Operations

Attributes and methods of a `GroupBy` object

sectors.nth(0)
sectors.nth(1)
fortune[fortune["Sector"] == "Apparel"]

The nth() method is used with a GroupBy object to select the nth observation from each group.
- Here we can confirm the output is correct by filtering for the “Apparel” observations in fortune.

Group Operations

Attributes and methods of a `GroupBy` object

sectors.head(2)
sectors.tail(2)

The head(n)/tail(n) method extracts the first/last n observations from each group.

Group Operations

Attributes and methods of a `GroupBy` object

sectors.get_group("Energy")
type( sectors.get_group("Energy") )

We can use the get_group() method to extract all observations in a given group.
- The method returns a DataFrame containing the observations.

Group Operations

Aggregation

We can invoke methods on the GroupBy object to apply aggregate operations to every group.
- Aggregation is the process of taking multiple values and returning a single value.

sectors.sum()
sectors.mean()

Group Operations

Aggregation

sectors["Revenues_M"]
sectors["Revenues_M"].sum()
sectors["Revenues_M"].mean()
sectors["Revenues_M"].max()
sectors["Revenues_M"].min()

We can target a single variable by passing its name inside square brackets after the GroupBy object.
- Pandas returns a new object, a SeriesGroupBy.

Group Operations

Aggregation

sectors["Revenues_M"]  # this is a SeriesGroupBy object
sectors["Revenues_M"].agg('sum')
sectors["Revenues_M"].agg('mean')
sectors["Revenues_M"].agg('max')
sectors["Revenues_M"].agg('min')

The agg() method can also be used on a SeriesGroupBy.
Instead of directly calling the aggregation method, we can call the agg() method, and pass the aggregation method we want in there.

Group Operations

Aggregation

sectors.agg(
  Revenues_M_min = ("Revenues_M", "min"),
  Profits_M_max = ("Profits_M", "max"),
  Number_of_employees_mean = ("Revenues_M", "mean")
)

The agg() method can apply multiple aggregate operations to different variables and can accept a tuple as its argument.

Group Operations

Aggregation

We pass in whatever aggregation we want.
- Some common options are in the table above.

Group Operations

Let’s do Questions 1-6 in Part 1 of Classwork 6!

Group Operations

Add a New Variable with `GroupBy.transform()`

Just like the agg() method, the transform() method can accept the aggregation method (e.g., 'sum', 'mean').
Unlike the agg() method, the transform() method does not collapse DataFrame and goes back to the original index.

sectors['Revenues_M'].transform('min')
sectors['Profits_M'].transform('max')
sectors['Number_of_employees'].transform('mean')

Group Operations

Add a New Variable with `GroupBy.transform()`

fortune['Revenues_M_min'] = sectors['Revenues_M'].transform('min')
fortune['Profits_M_max'] = sectors['Profits_M'].transform('max')
fortune['Number_of_employees_mean'] = sectors['Number_of_employees'].transform('mean')

Since the transform() method returns a Series with the index label that is the same as in the original DataFrame, it can be used to add a new variable to the original DataFrame.

Group Operations

Add New Variables with `DataFrame.assign()` and `GroupBy.transform()`

sectors = fortune.groupby("Sector")

fortune = fortune.assign(
    Revenues_M_min = sectors['Revenues_M'].transform('min'),
    Profits_M_max = sectors['Profits_M'].transform('max'),
    Number_of_employees_mean = sectors['Number_of_employees'].transform('mean')
)

The assign() method allows us to create new variables or overwrite existing ones in a DataFrame
- It returns a new DataFrame with the new variables added.
The assign() method can be useful for method chaining.

Group Operations

`agg()` vs. `transform()`

Use agg() with groupby() when you want to take multiple values and return a single (aggregated) value for each group.
- agg() does collapse the DataFrame, and groups become the new index labels.
Use transform() with groupby() when you want to perform computations on your groups but you want to return a single (aggregated) value for each observation.
- transform() does not collapse the DataFrame.

Group Operations

Let’s do Question 7 of Classwork 7!

Group Operations

Applying a Custom Operation to All Groups

Suppose that we want to apply a custom operation to each group in a GroupBy object.
- For practice, let’s now shuffle observations in fortune.

fortune = (
    fortune
    .sample( frac = 1 )
    .reset_index( drop = True )
)

sample(frac=1): This samples 100% of the observations from the DataFrame, effectively shuffling it.
- The frac parameter specifies the fraction of observations to return in the random sample, so frac=1 means “return all observations”.

Group Operations

Applying a Custom Operation to All Groups

Suppose that we want to apply a custom operation to each group in a GroupBy object.
- For practice, let’s now shuffle observations in fortune.

fortune = (
    fortune
    .sample( frac = 1 )
    .reset_index( drop = True )
)

reset_index(): This resets the index of the DataFrame.
- The drop = True option is used to prevent the old index from being added as a variable in the new DataFrame.

Group Operations

Applying a Custom Operation to All Groups

How can we identify top 5 companies in each sector?
1. sort_values() with groupby()
2. nlargest() with groupby()
However, DataFrameGroupBy has no methods sort_values() or nlargest().

Group Operations

Applying a Custom Operation to All Groups

How can we apply the nlargest() method to each group?
- First, define a custom function that accepts a single argument: a DataFrame.
- Second, pass the apply() method the custom function.

def get_largest_obs(df):
    return df.nlargest(1, "Revenues_M", keep="all")

sectors.apply(get_largest_obs)

We can use the apply() method when pandas does not support a custom aggregation we would like to apply to each group.

Group Operations

Applying a Custom Operation to All Groups

To rewrite our get_largest_obs function as a lambda function with the apply() method, we can do the following.

sectors.apply(lambda df: df.nlargest(1, "Revenues_M", keep="all"))

Group Operations

Grouping by Multiple Variables

sector_and_industry = fortune.groupby(["Sector", "Industry"])

We can create a GroupBy object with values from multiple variables.
- This operation is optimal when a combination of variables serves as the best identifier for a group.

Group Operations

Grouping by Multiple Variables

(
    sector_and_industry
    .size()
    .reset_index(name = "n")
)

The GroupBy object’s size() method now returns a MultiIndex Series with a count of observations for each internal group.
The reset_index() method can be used to convert a Series into a DataFrame.
- The name option renames a variable of index when resetting index.

Group Operations

Grouping by Multiple Variables

(
  sector_and_industry
  .get_group(("Business Services", "Financial Data Services"))
)

The get_group() method requires a tuple of values to extract a nested DataFrame from the GroupBy collection.

Group Operations

Grouping by Multiple Variables

sector_and_industry.sum()
sector_and_industry["Revenues_M"].mean()

(
    sector_and_industry["Revenues_M"]
    .mean()
    .reset_index(name = "Revenues_mean")
)

For all aggregations, pandas returns a MultiIndex DataFrame with the calculations.

Group Operations

Let’s do Question 7 in Part 1 of Classwork 6!

Reshaping `DataFrames`

Tidy `DataFrames`

A variable is the measurements that vary.
- It could be anything from a person’s age, the temperature on a given day, or the sales figures for a product.
An observation represents a unit of analysis.
- It could be a measurement, a survey response, or any other unit of data collected.
- E.g., In a DataFrame of patient information, each observation could correspond to a single patient’s data record.
A value is the actual data point corresponding to the variable and observation.

Reshaping `DataFrames`

Tidy `DataFrames`

There are three interrelated rules that make a DataFrame tidy:
- Each variable is a column; each column is a variable.
- Each observation is a row; each row is an observation.
- Each value is a cell; each cell is a single value.

Reshaping `DataFrames`

A DataFrame can be given in a format unsuited for the analysis that we would like to perform on it.
- A DataFrame may have larger structural problems that extend beyond the data.
- Perhaps the DataFrame stores its values in a format that makes it easy to extract a single row but difficult to aggregate the data.
Reshaping a DataFrame means manipulating it into a different shape.
In this section, we will discuss pandas techniques for molding DataFrame into the shapes we desire.

Reshaping `DataFrames`

Wide vs. Long `DataFrames`

The following DataFrames measure temperatures in two cities over two days.

df_wide = pd.DataFrame({
    'Weekday': ['Tuesday', 'Wednesday'],
    'Miami': [80, 83],
    'Rochester': [57, 62],
    'St. Louis': [71, 75]
})

df_long = pd.DataFrame({
    'Weekday': ['Tuesday', 'Wednesday', 'Tuesday', 'Wednesday', 'Tuesday', 'Wednesday'],
    'City': ['Miami', 'Miami', 'Rochester', 'Rochester', 'St. Louis', 'St. Louis'],
    'Temperature': [80, 83, 57, 62, 71, 75]
})

Reshaping `DataFrames`

Wide vs. Long `DataFrames`

A DataFrame can store its values in wide or long format.
These names reflect the direction in which the data set expands as we add more values to it.
- A wide DataFrame increases in width.
- A long DataFrame increases in height.
The optimal storage format for a DataFrame depends on the insight we are trying to glean from it.
- We consider making DataFrames wider if one observation is spread across multiple rows.
- We consider making DataFrames longer if one variable is spread across multiple columns.

Reshaping `DataFrames`

`melt()` and `pivot()`

melt() makes DataFrame longer.
pivot() and pivot_table() make DataFrame wider.

Reshaping `DataFrames`

Make `DataFrame` Longer with `melt()`

df_wide_to_long = (
    df_wide
    .melt()
)

Reshaping `DataFrames`

Make `DataFrame` Longer with `melt()`

df_wide_to_long = (
    df_wide
    .melt(id_vars = "Weekday")
)

melt() can take a few parameters:
- id_vars is a container (string, list, tuple, or array) that represents the variables that will remain as is.
- id_vars can indicate which column should be the “identifier”.

Reshaping `DataFrames`

Make `DataFrame` Longer with `melt()`

df_wide_to_long = (
    df_wide
    .melt(id_vars = "Weekday",
          var_name = "City",
          value_name = "Temperature")
)

melt() can take a few parameters:
- var_name is a string for the new column name for the variable.
- value_name is a string for the new column name that represents the values for the var_name.

Reshaping `DataFrames`

Make `DataFrame` Longer with `melt()`

df_wide_to_long = (
    df_wide
    .melt(id_vars = "Weekday",
          var_name = "City",
          value_name = "Temperature",
          value_vars = ['Miami', 'Rochester'])
)

melt() can take a few parameters:
- value_vars parameter allows us to select which specific columns we want to “melt”.
- By default, it will melt all the columns not specified in the id_vars parameter.

Reshaping `DataFrames`

Make `DataFrame` Wider with `pivot()`

df_long_to_wide = (
    df_long
    .pivot(index = "Weekday",
           columns = "City",
           values = "Temperature"  # To avoid having MultiIndex
        )
    .reset_index()
    )

When using pivot(), we need to specify a few parameters:
- index that takes the column to pivot on;
- columns that takes the column to be used to make the new variable names of the wider DataFrame;
- values that takes the column that provides the values of the variables in the wider DataFrame.

Reshaping `DataFrames`

Let’s consider the following wide-form DataFrame, df, containing information about the number of courses each student took from each department in each year.

dict_data = {"Name": ["Donna", "Donna", "Mike", "Mike"],
             "Department": ["ECON", "DANL", "ECON", "DANL"],
             "2018": [1, 2, 3, 1],
             "2019": [2, 3, 4, 2],
             "2020": [5, 1, 2, 2]}
df = pd.DataFrame(dict_data)

df_longer = df.melt(id_vars=["Name", "Department"], 
                    var_name="Year", 
                    value_name="Number of Courses")

The pivot_table() method can take both a string and a list of variables for the index parameter.
- The pivot() can take only a string for index.

Reshaping `DataFrames`

Let’s consider the following wide-form DataFrame, df, containing information about the number of courses each student took from each department in each year.

dict_data = {"Name": ["Donna", "Donna", "Mike", "Mike"],
             "Department": ["ECON", "DANL", "ECON", "DANL"],
             "2018": [1, 2, 3, 1],
             "2019": [2, 3, 4, 2],
             "2020": [5, 1, 2, 2]}
df = pd.DataFrame(dict_data)

df_longer = df.melt(id_vars=["Name", "Department"], 
                    var_name="Year", 
                    value_name="Number of Courses")

Q. How can we use the df_longer to create the wide-form DataFrame, df_wider, which is equivalent to the df?

Reshaping `DataFrames`

Let’s do Part 2 of Classwork 6!

Joining `DataFrames`

Relational Data

Sometimes, one data set is scattered across multiple files.
- The size of the files can be huge.
- The data collection process can be scattered across time and space.
- E.g., DataFrame for county-level data and DataFrame for geographic information, such as longitude and latitude.
Sometimes we want to combine two or more DataFrames based on common data values in those DataFrames.
- This task is known in the database world as performing a “join.”
- We can do this with the merge() method in Pandas.

Joining `DataFrames`

Relational Data

Why is one data set sometimes scattered across multiple files?
Sometimes we may have two or more DataFrames that we want to combine based on common data values.
- This task is known in the database world as performing a “join.”
- We can do this with the merge() method in Pandas.

Joining `DataFrames`

Relational Data

The variables that are used to connect each pair of tables are called keys.

Joining `DataFrames` with `merge()`

x = pd.DataFrame({
    'key': [1, 2, 3],
    'val_x': ['x1', 'x2', 'x3']
})

y = pd.DataFrame({
    'key': [1, 2, 4],
    'val_y': ['y1', 'y2', 'y3']
})

The colored column represents the “key” variable.
The grey column represents the “value” column.

An inner join matches pairs of observations whenever their keys are equal:

# the default value for 'how' is 'inner'
# so it doesn't actually need to be specified
merge_inner = pd.merge(x, y, on='key', how='inner')
merge_inner_x = x.merge(y, on='key', how='inner')
merge_inner_x_how = x.merge(y, on='key')

A left join keeps all observations in x.

merge_left = pd.merge(x, y, on='key', how='left')
merge_left_x = x.merge(y, on='key', how='left')

The most commonly used join is the left join.

A right join keeps all observations in y.

merge_right = pd.merge(x, y, on='key', how='right')
merge_right_x = x.merge(y, on='key', how='right')

A full join keeps all observations in x and y.

merge_outer = pd.merge(x, y, on='key', how='outer')
merge_outer_x = x.merge(y, on='key', how='outer')

Joining `DataFrames` with `merge()`

Duplicate keys

one-to-many
many-to-many

One DataFrame has duplicate keys (a one-to-many relationship).

x = pd.DataFrame({
    'key':[1, 2, 2, 3],
    'val_x':['x1', 'x2', 'x3', 'x4']})

y = pd.DataFrame({
    'key':[1, 2],
    'val_y':['y1', 'y2'] })
one_to_many = x.merge(y, on='key', 
                         how='left')

Both DataFrames have duplicate keys (many-to-many relationship).

x = pd.DataFrame({
  'key':[1, 2, 2, 3],
  'val_x':['x1','x2','x3','x4']})

y = pd.DataFrame({
  'key': [1, 2, 2, 3],
  'val_y': ['y1', 'y2', 'y3', 'y4'] })
many_to_many = x.merge(y, on='key', 
                          how='left')

Joining `DataFrames` with `merge()`

Defining the key columns

If the left and right columns do not have the same name for the key variables, we can use the left_on and right_on parameters instead.

x = pd.DataFrame({
  'key_x': [1, 2, 3],
  'val_x': ['x1', 'x2', 'x3']
})

y = pd.DataFrame({
  'key_y': [1, 2],
  'val_y': ['y1', 'y2'] })

keys_xy = 
  x.merge(y, left_on = 'key_x', 
             right_on = 'key_y', 
             how = 'left')

Joining `DataFrames` with `merge()`

Let’s do Part 3 of Classwork 6!

Lecture 6

Learning Objectives

Primer on Custom Functions

Python Basics

Primer on Custom Functions

Python Basics

Primer on Custom Functions

Python Basics

Primer on Custom Functions - Lambda Functions

Group Operations

Group Operations

Fortune 1000 dataset

Group Operations

Creating a GroupBy object from a dataset

Group Operations

Creating a GroupBy object from a dataset

Group Operations

Attributes and methods of a GroupBy object

Group Operations

Attributes and methods of a GroupBy object

Group Operations

Attributes and methods of a GroupBy object

Group Operations

Attributes and methods of a GroupBy object

Group Operations

Attributes and methods of a GroupBy object

Group Operations

Aggregation

Group Operations

Aggregation

Group Operations

Aggregation

Group Operations

Aggregation

Group Operations

Aggregation

Group Operations

Group Operations

Add a New Variable with GroupBy.transform()

Group Operations

Add a New Variable with GroupBy.transform()

Group Operations

Add New Variables with DataFrame.assign() and GroupBy.transform()

Group Operations

agg() vs. transform()

Group Operations

Group Operations

Applying a Custom Operation to All Groups

Group Operations

Applying a Custom Operation to All Groups

Group Operations

Applying a Custom Operation to All Groups

Group Operations

Applying a Custom Operation to All Groups

Group Operations

Applying a Custom Operation to All Groups

Group Operations

Grouping by Multiple Variables

Group Operations

Grouping by Multiple Variables

Group Operations

Grouping by Multiple Variables

Group Operations

Grouping by Multiple Variables

Group Operations

Reshaping DataFrames

Reshaping DataFrames

Tidy DataFrames

Reshaping DataFrames

Tidy DataFrames

Reshaping DataFrames

Reshaping DataFrames

Wide vs. Long DataFrames

Reshaping DataFrames

Wide vs. Long DataFrames

Reshaping DataFrames

melt() and pivot()

Reshaping DataFrames

Make DataFrame Longer with melt()

Reshaping DataFrames

Creating a `GroupBy` object from a dataset

Creating a `GroupBy` object from a dataset

Attributes and methods of a `GroupBy` object

Attributes and methods of a `GroupBy` object

Attributes and methods of a `GroupBy` object

Attributes and methods of a `GroupBy` object

Attributes and methods of a `GroupBy` object

Add a New Variable with `GroupBy.transform()`

Add a New Variable with `GroupBy.transform()`

Add New Variables with `DataFrame.assign()` and `GroupBy.transform()`

`agg()` vs. `transform()`

Reshaping `DataFrames`

Reshaping `DataFrames`

Tidy `DataFrames`

Reshaping `DataFrames`

Tidy `DataFrames`

Reshaping `DataFrames`

Reshaping `DataFrames`

Wide vs. Long `DataFrames`

Reshaping `DataFrames`

Wide vs. Long `DataFrames`

Reshaping `DataFrames`

`melt()` and `pivot()`

Reshaping `DataFrames`

Make `DataFrame` Longer with `melt()`

Reshaping `DataFrames`

Make `DataFrame` Longer with `melt()`

Reshaping `DataFrames`

Make `DataFrame` Longer with `melt()`

Reshaping `DataFrames`

Make `DataFrame` Longer with `melt()`

Reshaping `DataFrames`

Make `DataFrame` Wider with `pivot()`

Reshaping `DataFrames`

Reshaping `DataFrames`

Reshaping `DataFrames`

Joining `DataFrames`

Joining `DataFrames`

Joining `DataFrames`

Joining `DataFrames`

Joining `DataFrames` with `merge()`

Joining `DataFrames` with `merge()`

Joining `DataFrames` with `merge()`

Joining `DataFrames` with `merge()`

Joining `DataFrames` with `merge()`