Lecture 6

Advanced Group Operations; Reshaping/Joining DataFrames

Byeong-Hak Choe

SUNY Geneseo

March 12, 2024

Learning Objectives

  • Primer on Custom Functions
  • Group Operations
  • Reshaping DataFrames
  • Joining DataFrames

Primer on Custom Functions

Python Basics

Primer on Custom Functions

  • A function can take any number and type of input parameters and return any number and type of output results.

  • We can do two things with a function:

    • Define it
    • Call it

Python Basics

Primer on Custom Functions

  • To define a Python function, we type def, the function name, parentheses enclosing any input parameters to the function, and then finally, a colon (:).

  • Let’s define a very simple function my_half() that has a parameter x and returns a value x / 2.

def my_half(x):
    return x / 2
    
my_half(2)
  • The values we pass into the function when we call it are known as arguments.

  • When we call a function with arguments, the values of those arguments are copied to their corresponding parameters inside the function.

Python Basics

Primer on Custom Functions - Lambda Functions

  • A Python lambda function is an anonymous function expressed as a single statement.
    • It is defined with the lambda keyword, which has no meaning other than “we are declaring an anonymous function”:
def my_half(x):
    return x / 2
    
equiv_my_half = lambda x: x / 2

my_half(2)
equiv_my_half(2)

Group Operations

Group Operations

Fortune 1000 dataset

  • The Fortune 1000 is a listing of the 1,000 largest companies in the United States by revenue.
    • The list is updated annually by the business magazine Fortune.
    • The fortune_2023.csv file is a collection of Fortune 1000 companies from 2023 (Source: Kaggle).
import pandas as pd
# Below is for an interactive display of DataFrame in Colab
from google.colab import data_table
data_table.enable_dataframe_formatter()
fortune1000 = pd.read_csv("https://bcdanl.github.io/data/fortune_2023.csv")
varlist = ['Company', 'Revenues_M', 'Profits_M', 'Number_of_employees', 'Sector', 'Industry']
fortune = fortune1000[varlist]
  • A sector can have many companies; An industry is a subcategory within a sector.

Group Operations

Creating a GroupBy object from a dataset

  • Let’s pull out all companies with a Sector value of "Retailing".
    • Then, calculate the Retailing sector’s average revenue:
in_retailing = ( fortune["Sector"] == "Retailing" )
retail_companies = fortune[ in_retailing ]
retail_companies["Revenues"].mean()
  • Without a group operation, we may need to write a lot of additional code to apply the same logic to the other 20 sectors in fortune.
    • Pandas’ GroupBy object offers the best solution out of the box.

Group Operations

Creating a GroupBy object from a dataset

sectors = fortune.groupby("Sector")
len(sectors)    # fortune["Sector"].nunique()
sectors.size()
  • We can count the number of groups in sectors by passing the GroupBy object into the Python’s built-in len() function

  • The size() method on the GroupBy object returns a Series with an alphabetical list of the groups and their observation counts.

Group Operations

Attributes and methods of a GroupBy object

sectors.groups
  • The groups attribute stores a dictionary with associations of group-to-observations.

Group Operations

Attributes and methods of a GroupBy object

  • What if we want to find the highest-performing company (by revenue) within each sector?
sectors.first()
sectors.last()
  • The GroupBy object’s first()/last() method extracts the first/last observation listed for each group in a DataFrame.
    • Since our fortune DataFrame is sorted by Revenue_M, the first company pulled out for each sector will be the highest-performing company within that sector.

Group Operations

Attributes and methods of a GroupBy object

sectors.nth(0)
sectors.nth(1)
fortune[fortune["Sector"] == "Apparel"]
  • The nth() method is used with a GroupBy object to select the nth observation from each group.
    • Here we can confirm the output is correct by filtering for the “Apparel” observations in fortune.

Group Operations

Attributes and methods of a GroupBy object

sectors.head(2)
sectors.tail(2)
  • The head(n)/tail(n) method extracts the first/last n observations from each group.

Group Operations

Attributes and methods of a GroupBy object

sectors.get_group("Energy")
type( sectors.get_group("Energy") )
  • We can use the get_group() method to extract all observations in a given group.
    • The method returns a DataFrame containing the observations.

Group Operations

Aggregation

  • We can invoke methods on the GroupBy object to apply aggregate operations to every group.
    • Aggregation is the process of taking multiple values and returning a single value.
sectors.sum()
sectors.mean()

Group Operations

Aggregation

sectors["Revenues_M"]
sectors["Revenues_M"].sum()
sectors["Revenues_M"].mean()
sectors["Revenues_M"].max()
sectors["Revenues_M"].min()
  • We can target a single variable by passing its name inside square brackets after the GroupBy object.
    • Pandas returns a new object, a SeriesGroupBy.

Group Operations

Aggregation

sectors["Revenues_M"]  # this is a SeriesGroupBy object
sectors["Revenues_M"].agg('sum')
sectors["Revenues_M"].agg('mean')
sectors["Revenues_M"].agg('max')
sectors["Revenues_M"].agg('min')
  • The agg() method can also be used on a SeriesGroupBy.
  • Instead of directly calling the aggregation method, we can call the agg() method, and pass the aggregation method we want in there.

Group Operations

Aggregation

sectors.agg(
  Revenues_M_min = ("Revenues_M", "min"),
  Profits_M_max = ("Profits_M", "max"),
  Number_of_employees_mean = ("Revenues_M", "mean")
)
  • The agg() method can apply multiple aggregate operations to different variables and can accept a tuple as its argument.

Group Operations

Aggregation

  • We pass in whatever aggregation we want.
    • Some common options are in the table above.

Group Operations

Let’s do Questions 1-6 in Part 1 of Classwork 6!

Group Operations

Add a New Variable with GroupBy.transform()

  • Just like the agg() method, the transform() method can accept the aggregation method (e.g., 'sum', 'mean').

  • Unlike the agg() method, the transform() method does not collapse DataFrame and goes back to the original index.

sectors['Revenues_M'].transform('min')
sectors['Profits_M'].transform('max')
sectors['Number_of_employees'].transform('mean')

Group Operations

Add a New Variable with GroupBy.transform()

fortune['Revenues_M_min'] = sectors['Revenues_M'].transform('min')
fortune['Profits_M_max'] = sectors['Profits_M'].transform('max')
fortune['Number_of_employees_mean'] = sectors['Number_of_employees'].transform('mean')
  • Since the transform() method returns a Series with the index label that is the same as in the original DataFrame, it can be used to add a new variable to the original DataFrame.

Group Operations

Add New Variables with DataFrame.assign() and GroupBy.transform()

sectors = fortune.groupby("Sector")

fortune = fortune.assign(
    Revenues_M_min = sectors['Revenues_M'].transform('min'),
    Profits_M_max = sectors['Profits_M'].transform('max'),
    Number_of_employees_mean = sectors['Number_of_employees'].transform('mean')
)
  • The assign() method allows us to create new variables or overwrite existing ones in a DataFrame
    • It returns a new DataFrame with the new variables added.
  • The assign() method can be useful for method chaining.

Group Operations

agg() vs. transform()

  • Use agg() with groupby() when you want to take multiple values and return a single (aggregated) value for each group.
    • agg() does collapse the DataFrame, and groups become the new index labels.
  • Use transform() with groupby() when you want to perform computations on your groups but you want to return a single (aggregated) value for each observation.
    • transform() does not collapse the DataFrame.

Group Operations

Let’s do Question 7 of Classwork 7!

Group Operations

Applying a Custom Operation to All Groups

  • Suppose that we want to apply a custom operation to each group in a GroupBy object.
    • For practice, let’s now shuffle observations in fortune.
fortune = (
    fortune
    .sample( frac = 1 )
    .reset_index( drop = True )
)
  • sample(frac=1): This samples 100% of the observations from the DataFrame, effectively shuffling it.
    • The frac parameter specifies the fraction of observations to return in the random sample, so frac=1 means “return all observations”.

Group Operations

Applying a Custom Operation to All Groups

  • Suppose that we want to apply a custom operation to each group in a GroupBy object.
    • For practice, let’s now shuffle observations in fortune.
fortune = (
    fortune
    .sample( frac = 1 )
    .reset_index( drop = True )
)
  • reset_index(): This resets the index of the DataFrame.
    • The drop = True option is used to prevent the old index from being added as a variable in the new DataFrame.

Group Operations

Applying a Custom Operation to All Groups

  • How can we identify top 5 companies in each sector?
    1. sort_values() with groupby()
    2. nlargest() with groupby()
  • However, DataFrameGroupBy has no methods sort_values() or nlargest().

Group Operations

Applying a Custom Operation to All Groups

  • How can we apply the nlargest() method to each group?
    • First, define a custom function that accepts a single argument: a DataFrame.
    • Second, pass the apply() method the custom function.
def get_largest_obs(df):
    return df.nlargest(1, "Revenues_M", keep="all")

sectors.apply(get_largest_obs)
  • We can use the apply() method when pandas does not support a custom aggregation we would like to apply to each group.

Group Operations

Applying a Custom Operation to All Groups

  • To rewrite our get_largest_obs function as a lambda function with the apply() method, we can do the following.
sectors.apply(lambda df: df.nlargest(1, "Revenues_M", keep="all"))

Group Operations

Grouping by Multiple Variables

sector_and_industry = fortune.groupby(["Sector", "Industry"])
  • We can create a GroupBy object with values from multiple variables.
    • This operation is optimal when a combination of variables serves as the best identifier for a group.

Group Operations

Grouping by Multiple Variables

(
    sector_and_industry
    .size()
    .reset_index(name = "n")
)
  • The GroupBy object’s size() method now returns a MultiIndex Series with a count of observations for each internal group.

  • The reset_index() method can be used to convert a Series into a DataFrame.

    • The name option renames a variable of index when resetting index.

Group Operations

Grouping by Multiple Variables

(
  sector_and_industry
  .get_group(("Business Services", "Financial Data Services"))
)
  • The get_group() method requires a tuple of values to extract a nested DataFrame from the GroupBy collection.

Group Operations

Grouping by Multiple Variables

sector_and_industry.sum()
sector_and_industry["Revenues_M"].mean()

(
    sector_and_industry["Revenues_M"]
    .mean()
    .reset_index(name = "Revenues_mean")
)
  • For all aggregations, pandas returns a MultiIndex DataFrame with the calculations.

Group Operations

Let’s do Question 7 in Part 1 of Classwork 6!

Reshaping DataFrames

Reshaping DataFrames

Tidy DataFrames

  • A variable is the measurements that vary.
    • It could be anything from a person’s age, the temperature on a given day, or the sales figures for a product.
  • An observation represents a unit of analysis.
    • It could be a measurement, a survey response, or any other unit of data collected.
    • E.g., In a DataFrame of patient information, each observation could correspond to a single patient’s data record.
  • A value is the actual data point corresponding to the variable and observation.

Reshaping DataFrames

Tidy DataFrames

  • There are three interrelated rules that make a DataFrame tidy:
    • Each variable is a column; each column is a variable.
    • Each observation is a row; each row is an observation.
    • Each value is a cell; each cell is a single value.

Reshaping DataFrames

  • A DataFrame can be given in a format unsuited for the analysis that we would like to perform on it.

    • A DataFrame may have larger structural problems that extend beyond the data.
    • Perhaps the DataFrame stores its values in a format that makes it easy to extract a single row but difficult to aggregate the data.
  • Reshaping a DataFrame means manipulating it into a different shape.

  • In this section, we will discuss pandas techniques for molding DataFrame into the shapes we desire.

Reshaping DataFrames

Wide vs. Long DataFrames

  • The following DataFrames measure temperatures in two cities over two days.
df_wide = pd.DataFrame({
    'Weekday': ['Tuesday', 'Wednesday'],
    'Miami': [80, 83],
    'Rochester': [57, 62],
    'St. Louis': [71, 75]
})

df_long = pd.DataFrame({
    'Weekday': ['Tuesday', 'Wednesday', 'Tuesday', 'Wednesday', 'Tuesday', 'Wednesday'],
    'City': ['Miami', 'Miami', 'Rochester', 'Rochester', 'St. Louis', 'St. Louis'],
    'Temperature': [80, 83, 57, 62, 71, 75]
})

Reshaping DataFrames

Wide vs. Long DataFrames

  • A DataFrame can store its values in wide or long format.
  • These names reflect the direction in which the data set expands as we add more values to it.
    • A wide DataFrame increases in width.
    • A long DataFrame increases in height.
  • The optimal storage format for a DataFrame depends on the insight we are trying to glean from it.
    • We consider making DataFrames wider if one observation is spread across multiple rows.
    • We consider making DataFrames longer if one variable is spread across multiple columns.

Reshaping DataFrames

melt() and pivot()

  • melt() makes DataFrame longer.
  • pivot() and pivot_table() make DataFrame wider.

Reshaping DataFrames

Make DataFrame Longer with melt()

df_wide_to_long = (
    df_wide
    .melt()
)


Reshaping DataFrames

Make DataFrame Longer with melt()

df_wide_to_long = (
    df_wide
    .melt(id_vars = "Weekday")
)


  • melt() can take a few parameters:
    • id_vars is a container (string, list, tuple, or array) that represents the variables that will remain as is.
    • id_vars can indicate which column should be the “identifier”.

Reshaping DataFrames

Make DataFrame Longer with melt()

df_wide_to_long = (
    df_wide
    .melt(id_vars = "Weekday",
          var_name = "City",
          value_name = "Temperature")
)
  • melt() can take a few parameters:
    • var_name is a string for the new column name for the variable.
    • value_name is a string for the new column name that represents the values for the var_name.

Reshaping DataFrames

Make DataFrame Longer with melt()

df_wide_to_long = (
    df_wide
    .melt(id_vars = "Weekday",
          var_name = "City",
          value_name = "Temperature",
          value_vars = ['Miami', 'Rochester'])
)
  • melt() can take a few parameters:
    • value_vars parameter allows us to select which specific columns we want to “melt”.
    • By default, it will melt all the columns not specified in the id_vars parameter.

Reshaping DataFrames

Make DataFrame Wider with pivot()

df_long_to_wide = (
    df_long
    .pivot(index = "Weekday",
           columns = "City",
           values = "Temperature"  # To avoid having MultiIndex
        )
    .reset_index()
    )
  • When using pivot(), we need to specify a few parameters:
    • index that takes the column to pivot on;
    • columns that takes the column to be used to make the new variable names of the wider DataFrame;
    • values that takes the column that provides the values of the variables in the wider DataFrame.

Reshaping DataFrames

  • Let’s consider the following wide-form DataFrame, df, containing information about the number of courses each student took from each department in each year.
dict_data = {"Name": ["Donna", "Donna", "Mike", "Mike"],
             "Department": ["ECON", "DANL", "ECON", "DANL"],
             "2018": [1, 2, 3, 1],
             "2019": [2, 3, 4, 2],
             "2020": [5, 1, 2, 2]}
df = pd.DataFrame(dict_data)

df_longer = df.melt(id_vars=["Name", "Department"], 
                    var_name="Year", 
                    value_name="Number of Courses")
  • The pivot_table() method can take both a string and a list of variables for the index parameter.
    • The pivot() can take only a string for index.

Reshaping DataFrames

  • Let’s consider the following wide-form DataFrame, df, containing information about the number of courses each student took from each department in each year.
dict_data = {"Name": ["Donna", "Donna", "Mike", "Mike"],
             "Department": ["ECON", "DANL", "ECON", "DANL"],
             "2018": [1, 2, 3, 1],
             "2019": [2, 3, 4, 2],
             "2020": [5, 1, 2, 2]}
df = pd.DataFrame(dict_data)

df_longer = df.melt(id_vars=["Name", "Department"], 
                    var_name="Year", 
                    value_name="Number of Courses")

Q. How can we use the df_longer to create the wide-form DataFrame, df_wider, which is equivalent to the df?

Reshaping DataFrames

Let’s do Part 2 of Classwork 6!

Joining DataFrames

Joining DataFrames

Relational Data

  • Sometimes, one data set is scattered across multiple files.
    • The size of the files can be huge.
    • The data collection process can be scattered across time and space.
    • E.g., DataFrame for county-level data and DataFrame for geographic information, such as longitude and latitude.
  • Sometimes we want to combine two or more DataFrames based on common data values in those DataFrames.
    • This task is known in the database world as performing a “join.”
    • We can do this with the merge() method in Pandas.

Joining DataFrames

Relational Data

  • Why is one data set sometimes scattered across multiple files?

  • Sometimes we may have two or more DataFrames that we want to combine based on common data values.

    • This task is known in the database world as performing a “join.”
    • We can do this with the merge() method in Pandas.

Joining DataFrames

Relational Data

  • The variables that are used to connect each pair of tables are called keys.

Joining DataFrames with merge()

x = pd.DataFrame({
    'key': [1, 2, 3],
    'val_x': ['x1', 'x2', 'x3']
})
y = pd.DataFrame({
    'key': [1, 2, 4],
    'val_y': ['y1', 'y2', 'y3']
})
  • The colored column represents the “key” variable.
  • The grey column represents the “value” column.

Joining DataFrames with merge()

  • An inner join matches pairs of observations whenever their keys are equal:

# the default value for 'how' is 'inner'
# so it doesn't actually need to be specified
merge_inner = pd.merge(x, y, on='key', how='inner')
merge_inner_x = x.merge(y, on='key', how='inner')
merge_inner_x_how = x.merge(y, on='key')
  • A left join keeps all observations in x.

merge_left = pd.merge(x, y, on='key', how='left')
merge_left_x = x.merge(y, on='key', how='left')
  • The most commonly used join is the left join.
  • A right join keeps all observations in y.

merge_right = pd.merge(x, y, on='key', how='right')
merge_right_x = x.merge(y, on='key', how='right')
  • A full join keeps all observations in x and y.

merge_outer = pd.merge(x, y, on='key', how='outer')
merge_outer_x = x.merge(y, on='key', how='outer')

Joining DataFrames with merge()

Duplicate keys

  • One DataFrame has duplicate keys (a one-to-many relationship).

x = pd.DataFrame({
    'key':[1, 2, 2, 3],
    'val_x':['x1', 'x2', 'x3', 'x4']})
y = pd.DataFrame({
    'key':[1, 2],
    'val_y':['y1', 'y2'] })
one_to_many = x.merge(y, on='key', 
                         how='left')
  • Both DataFrames have duplicate keys (many-to-many relationship).

x = pd.DataFrame({
  'key':[1, 2, 2, 3],
  'val_x':['x1','x2','x3','x4']})
y = pd.DataFrame({
  'key': [1, 2, 2, 3],
  'val_y': ['y1', 'y2', 'y3', 'y4'] })
many_to_many = x.merge(y, on='key', 
                          how='left')

Joining DataFrames with merge()

Defining the key columns

  • If the left and right columns do not have the same name for the key variables, we can use the left_on and right_on parameters instead.
x = pd.DataFrame({
  'key_x': [1, 2, 3],
  'val_x': ['x1', 'x2', 'x3']
})
y = pd.DataFrame({
  'key_y': [1, 2],
  'val_y': ['y1', 'y2'] })

keys_xy = 
  x.merge(y, left_on = 'key_x', 
             right_on = 'key_y', 
             how = 'left')

Joining DataFrames with merge()

Let’s do Part 3 of Classwork 6!