Lecture 5

Dealing with Missing Values and Duplicates; Group Operations

Byeong-Hak Choe

SUNY Geneseo

March 5, 2024

Dealing with Missing Values

Pandas Basics

  • Let’s read employment.csv as emp.
import pandas as pd
# Below is for an interactive display of DataFrame in Colab
from google.colab import data_table
data_table.enable_dataframe_formatter()

emp = pd.read_csv("https://bcdanl.github.io/data/employment.csv")

Pandas Basics

Dealing with Missing Values

  • Pandas often marks (1) missing text values and (2) missing numeric values with a NaN (not a number);
    • It also marks missing datetime values with a NaT (not a time).

Pandas Basics

Dealing with Missing Values: The isna() and notna() methods

emp["Team"].isna()
emp["Start Date"].isna()
  • The isna() method returns a Boolean Series in which True denotes that an observation’s value is missing.
    • Is a value of a variable “XYZ” missing?

Pandas Basics

Dealing with Missing Values: The isna() and notna() methods

# Below two are equivalent.
emp["Team"].notna()
~emp["Team"].isna()
  • The notna() method returns the inverse Series, one in which True indicates that an observation’s value is present.

  • We use the tilde symbol (~) to invert a Boolean Series.

  • Q. How can we pull out employees with non-missing Team values?

Pandas Basics

Dealing with Missing Values: The value_counts(dropna = False) method

emp["Mgmt"].isna().sum()
emp["Mgmt"].value_counts()
emp["Mgmt"].value_counts(dropna = False)
  • One way to missing data counts is to use the isna().sum() on a Series.
    • True is 1 and False is 0.
  • Another way to get missing data counts is to use the .value_counts() method on a Series.
    • If we use the dropna=False option, we can also get a missing value count.

Pandas Basics

Dealing with Missing Values: The dropna() method

emp = pd.read_csv("https://bcdanl.github.io/data/employment.csv",
                  parse_dates = ["Start Date"])
emp.dropna()
  • The dropna() method removes observations that hold any NaN or NaT values.

Pandas Basics

Dealing with Missing Values: The dropna() method with how

emp.dropna(how = "all")
  • We can pass the how parameter an argument of "all" to remove observations in which all values are missing.

  • Note that the how parameter’s default argument is "any".

Pandas Basics

Dealing with Missing Values: The dropna() method with subset

emp.dropna(subset = ["Gender"])
  • We can use the subset parameter to target observations with a missing value in a specific variable.
    • The above example removes observations that have a missing value in the Gender variable.

Pandas Basics

Dealing with Missing Values: The dropna() method with subset

emp.dropna(subset = ["Start Date", "Salary"])
  • We can also pass the subset parameter a list of variables.

Pandas Basics

Dealing with Missing Values: The dropna() method with thresh

emp.dropna(thresh = 4)
  • The thresh parameter specifies a minimum threshold of non-missing values that an observation must have for pandas to keep it.

Dealing with Duplicates

Pandas Basics

Dealing with Duplicates with the duplicated() method

  • Missing values are a common occurrence in messy data sets, and so are duplicate values.
emp["Team"].duplicated()
  • The duplicated() method returns a Boolean Series that identifies duplicates in a variable.

Pandas Basics

Dealing with Duplicates with the duplicated() method

emp["Team"].duplicated(keep = "first")
emp["Team"].duplicated(keep = "last")
~emp["Team"].duplicated()
  • The duplicated() method’s keep parameter informs pandas which duplicate occurrence to keep.
    • Its default argument, "first", keeps the first occurrence of each duplicate value.
    • Its argument, "last", keeps the last occurrence of each duplicate value.
  • Q. How can we keep observations with the first occurrences of a value in the Team variable?

Pandas Basics

Dealing with Duplicates with the drop_duplicates() method

emp.drop_duplicates()
  • The drop_duplicates() method removes observations in which all values are equal to those in a previously encountered observations.

Pandas Basics

Dealing with Duplicates with the drop_duplicates() method

  • Below is an example of the drop_duplicates() method:
# Sample DataFrame with duplicate observations
data = {
    'Name': ['John', 'Anna', 'John', 'Mike', 'Anna'],
    'Age': [28, 23, 28, 32, 23],
    'City': ['New York', 'Paris', 'New York', 'London', 'Paris']
}

# pd.DataFrame( Series, List, or Dict ) creates a DataFrame
df = pd.DataFrame(data)  
df_unique = df.drop_duplicates()

Pandas Basics

Dealing with Duplicates with the drop_duplicates() method

emp.drop_duplicates(subset = ["Team"])
  • We can pass the drop_duplicates() method a subset parameter with a list of variables that pandas should use to determine an observation’s uniqueness.

    • The above example finds the first occurrence of each unique value in the ‘Team’ variable.

Pandas Basics

Dealing with Duplicates with the drop_duplicates() method

emp.drop_duplicates(subset = ["Gender", "Team"])
  • The above example uses a combination of values across the Gender and Team variables to identify duplicates.

Pandas Basics

Dealing with Duplicates with the drop_duplicates() method

emp.drop_duplicates(subset = ["Team"], keep = "last")
emp.drop_duplicates(subset = ["Team"], keep = False)
  • The drop_duplicates() method also accepts a keep parameter.

    • We can pass it an argument of "last" to keep the observations with each duplicate value’s last occurrence.
    • We can pass it an argument of False to exclude all observations with duplicate values.
  • Q. What does emp.drop_duplicates(subset = ["First Name"], keep = False) do?

  • Q. Find a subset of all employees with a First Name of “Douglas” and a Gender of “Male”. Then check which “Douglas” is in the DataFrame emp.drop_duplicates(subset = ["Gender", "Team"]).

Pandas Basics

Let’s do Questions 1-3 in Part 1 of Classwork 5!

Group Operations

Group Operations

  • The pandas library’s GroupBy object is a storage container for grouping observations into buckets.

  • It provides a set of methods to aggregate and analyze each independent group in the collection.

  • It allows us to extract observations at specific index positions within each group.

  • It also offers a convenient way to iterate over the groups of observations.

Group Operations

Creating a GroupBy Object

  • pd.DataFrame() converts a List/Dictionary/Series into a DataFrame ojbect
# Creating a dictionary
food_data = {
"Item": ["Apple", "Onion", "Orange", "Tomato", "Watermelon"],
"Type": ["Fruit", "Vegie", "Fruit", "Vegie", "Fruit"],
"Price": [1.05, 1.00, 1.25, 0.85, 4.15]
}

# Converting a dictionary into a DataFrame
supermarket = pd.DataFrame(data = food_data)

Group Operations

Creating a GroupBy Object

  • The Type variable identifies the group to which an Item belongs.
    • There are two groups of items in the supermarket data set: “Fruit” and “Vegie”.
    • We can use terms such as groups, buckets, and clusters interchangeably to describe the same idea.
    • Multiple observations fall into the same category.

Group Operations

Creating a GroupBy Object

  • The GroupBy object implicitly organizes observations into buckets based on shared values in a categorical variable.

Group Operations

Creating a GroupBy Object

  • Suppose that we are interested in the average price of a fruit and the average price of a vegie.
    • If we could isolate the “Fruit” observations and “Vegie” observations into separate groups, it would be easier to perform the calculations.

Group Operations

Creating a GroupBy Object

  • Let’s begin by invoking the groupby() method on the supermarket DataFrame.
    • We need to pass groupby() a categorical variable whose values will be used to create the groups.
groups = supermarket.groupby("Type")
groups
  • The DataFrameGroupBy object is separate and distinct from a DataFrame

Group Operations

Creating a GroupBy Object

  • The “Type” variable has two unique values, so the GroupBy object will store two groups.
groups.get_group("Fruit")
groups.get_group("Vegie")
  • The get_group() method accepts a group name and returns a DataFrame with the corresponding observations.

Group Operations

Creating a GroupBy Object

  • Our original goal was to calculate the average price of the fruits and vegetables in supermarket.
groups.mean()
  • With a few lines of code, we have successfully split, aggregated, and analyzed a data set

Group Operations

Fortune 1000 dataset

  • The Fortune 1000 is a listing of the 1,000 largest companies in the United States by revenue.
    • The list is updated annually by the business magazine Fortune.
    • The fortune_2023.csv file is a collection of Fortune 1000 companies from 2023 (Source: Kaggle).
fortune1000 = pd.read_csv("https://bcdanl.github.io/data/fortune_2023.csv")

varlist = ['Company', 'Revenues_M', 'Profits_M', 'Number_of_employees', 'Sector', 'Industry']
fortune = fortune1000[varlist]
  • A sector can have many companies; An industry is a subcategory within a sector.
  • Let’s explore the fortune DataFrame.
    • How many unique sectors are in fortune?

Group Operations

Creating a GroupBy object from a dataset

  • Let’s pull out all companies with a Sector value of "Retailing".
    • Then, calculate the Retailing sector’s average revenue:
in_retailing = ( fortune["Sector"] == "Retailing" )
retail_companies = fortune[ in_retailing ]
retail_companies["Revenues"].mean()
  • Without a group operation, we may need to write a lot of additional code to apply the same logic to the other 20 sectors in fortune.
    • Pandas’ GroupBy object offers the best solution out of the box.

Group Operations

Creating a GroupBy object from a dataset

sectors = fortune.groupby("Sector")
len(sectors)    # fortune["Sector"].nunique()
sectors.size()
  • We can count the number of groups in sectors by passing the GroupBy object into the Python’s built-in len() function

  • The size() method on the GroupBy object returns a Series with an alphabetical list of the groups and their observation counts.

Group Operations

Attributes and methods of a GroupBy object

sectors.groups
  • The groups attribute stores a dictionary with associations of group-to-observations.

Group Operations

Attributes and methods of a GroupBy object

  • What if we want to find the highest-performing company (by revenue) within each sector?
sectors.first()
sectors.last()
  • The GroupBy object’s first()/last() method extracts the first/last observation listed for each group in a DataFrame.
    • Since our fortune DataFrame is sorted by Revenue_M, the first company pulled out for each sector will be the highest-performing company within that sector.

Group Operations

Attributes and methods of a GroupBy object

sectors.nth(0)
sectors.nth(1)
fortune[fortune["Sector"] == "Apparel"]
  • The nth() method is used with a GroupBy object to select the nth observation from each group.
    • Here we can confirm the output is correct by filtering for the “Apparel” observations in fortune.

Group Operations

Attributes and methods of a GroupBy object

sectors.head(2)
sectors.tail(2)
  • The head(n)/tail(n) method extracts the first/last n observations from each group.

Group Operations

Attributes and methods of a GroupBy object

sectors.get_group("Energy")
type( sectors.get_group("Energy") )
  • We can use the get_group() method to extract all observations in a given group.
    • The method returns a DataFrame containing the observations.

Group Operations

Aggregation

  • We can invoke methods on the GroupBy object to apply aggregate operations to every group.
    • Aggregation is the process of taking multiple values and returning a single value.
sectors.sum()
sectors.mean()

Group Operations

Aggregation

sectors["Revenues_M"]
sectors["Revenues"].sum()
sectors["Revenues"].mean()
sectors["Revenues"].max()
sectors["Revenues"].min()
  • We can target a single variable by passing its name inside square brackets after the GroupBy object.
    • Pandas returns a new object, a SeriesGroupBy.

Group Operations

Aggregation

sectors.agg(
  Revenues_M_min = ("Revenues_M", "min"),
  Profits_M_max = ("Profits_M", "max"),
  Number_of_employees_mean = ("Revenues_M", "mean")
)
  • The agg method applies multiple aggregate operations to different variables and can accept a tuple as its argument.

Group Operations

Aggregation

sectors["Revenues_M"].agg("mean")
  • The agg() method can also be used on a Series.

Group Operations

Aggregation

  • We pass in whatever aggregation we want.
    • Some common options are in the table above.

Group Operations

Applying a Custom Operation to All Groups

  • Suppose that we want to apply a custom operation to each group in a GroupBy object.
    • For practice, let’s now shuffle observations in fortune.
fortune = (
    fortune
    .sample( frac = 1 )
    .reset_index( drop = True )
)
  • sample(frac=1): This samples 100% of the observations from the DataFrame, effectively shuffling it.
    • The frac parameter specifies the fraction of observations to return in the random sample, so frac=1 means “return all observations”.

Group Operations

Applying a Custom Operation to All Groups

  • Suppose that we want to apply a custom operation to each group in a GroupBy object.
    • For practice, let’s now shuffle observations in fortune.
fortune_shuffled = (
    fortune
    .sample( frac = 1 )
    .reset_index( drop = True )
)
  • reset_index(): This resets the index of the DataFrame.
    • The drop = True option is used to prevent the old index from being added as a variable in the new DataFrame.

Group Operations

Let’s do Questions 1-6 in Part 2 of Classwork 5!

Group Operations

Applying a Custom Operation to All Groups

  • How can we identify top 5 companies in each sector?
    1. sort_values() with groupby()
    2. nlargest() with groupby()
  • However, DataFrameGroupBy has no methods sort_values() or nlargest().

Primer on Custom Functions

  • A function can take any number and type of input parameters and return any number and type of output results.

  • We can do two things with a function:

    • Define it
    • Call it

Primer on Custom Functions

  • To define a Python function, we type def, the function name, parentheses enclosing any input parameters to the function, and then finally, a colon (:).

  • Let’s define a very simple function my_half() that has a parameter x and returns a value x / 2.

def my_half(x):
    return x / 2
  • The values we pass into the function when we call it are known as arguments.

  • When we call a function with arguments, the values of those arguments are copied to their corresponding parameters inside the function.

Group Operations

Applying a Custom Operation to All Groups

  • How can we apply the nlargest() method to each group?
    • First, define a custom function that accepts a single argument: a DataFrame.
    • Second, pass the apply() method the custom function.
def get_largest_obs(df):
    return df.nlargest(1, "Revenues_M", keep="all")
    
sectors.apply(get_largest_obs)
  • We can use the apply() method when pandas does not support a custom aggregation we would like to apply to each group.

Group Operations

Grouping by Multiple Variables

sector_and_industry = fortune.groupby(["Sector", "Industry"])
  • We can create a GroupBy object with values from multiple variables.
    • This operation is optimal when a combination of variables serves as the best identifier for a group.

Group Operations

Grouping by Multiple Variables

(
    sector_and_industry
    .size()
    .reset_index(name = "n")
)
  • The GroupBy object’s size() method now returns a MultiIndex Series with a count of observations for each internal group.

  • The reset_index() method can be used to convert a Series into a DataFrame.

    • The name option renames a variable of index when resetting index.

Group Operations

Grouping by Multiple Variables

sector_and_industry.get_group(("Business Services", "Financial Data Services"))
  • The get_group() method requires a tuple of values to extract a nested DataFrame from the GroupBy collection.

Group Operations

Grouping by Multiple Variables

sector_and_industry.sum()
sector_and_industry["Revenues_M"].mean()

(
    sector_and_industry["Revenues_M"]
    .mean()
    .reset_index(name = "Revenues_mean")
)
  • For all aggregations, pandas returns a MultiIndex DataFrame with the calculations.

Group Operations

Let’s do Question 7 in Part 2 of Classwork 5!