Lecture 5

Dealing with Missing Values and Duplicates; Group Operations

Byeong-Hak Choe

SUNY Geneseo

March 5, 2024

Dealing with Missing Values

Pandas Basics

Let’s read employment.csv as emp.

import pandas as pd
# Below is for an interactive display of DataFrame in Colab
from google.colab import data_table
data_table.enable_dataframe_formatter()

emp = pd.read_csv("https://bcdanl.github.io/data/employment.csv")

Pandas Basics

Dealing with Missing Values

Pandas often marks (1) missing text values and (2) missing numeric values with a NaN (not a number);
- It also marks missing datetime values with a NaT (not a time).

Pandas Basics

Dealing with Missing Values: The `isna()` and `notna()` methods

emp["Team"].isna()
emp["Start Date"].isna()

The isna() method returns a Boolean Series in which True denotes that an observation’s value is missing.
- Is a value of a variable “XYZ” missing?

Pandas Basics

Dealing with Missing Values: The `isna()` and `notna()` methods

# Below two are equivalent.
emp["Team"].notna()
~emp["Team"].isna()

The notna() method returns the inverse Series, one in which True indicates that an observation’s value is present.
We use the tilde symbol (~) to invert a Boolean Series.
Q. How can we pull out employees with non-missing Team values?

Pandas Basics

Dealing with Missing Values: The `value_counts(dropna = False)` method

emp["Mgmt"].isna().sum()
emp["Mgmt"].value_counts()
emp["Mgmt"].value_counts(dropna = False)

One way to missing data counts is to use the isna().sum() on a Series.
- True is 1 and False is 0.
Another way to get missing data counts is to use the .value_counts() method on a Series.
- If we use the dropna=False option, we can also get a missing value count.

Pandas Basics

Dealing with Missing Values: The `dropna()` method

emp = pd.read_csv("https://bcdanl.github.io/data/employment.csv",
                  parse_dates = ["Start Date"])
emp.dropna()

The dropna() method removes observations that hold any NaN or NaT values.

Pandas Basics

Dealing with Missing Values: The `dropna()` method with `how`

emp.dropna(how = "all")

We can pass the how parameter an argument of "all" to remove observations in which all values are missing.
Note that the how parameter’s default argument is "any".

Pandas Basics

Dealing with Missing Values: The `dropna()` method with `subset`

emp.dropna(subset = ["Gender"])

We can use the subset parameter to target observations with a missing value in a specific variable.
- The above example removes observations that have a missing value in the Gender variable.

Pandas Basics

Dealing with Missing Values: The `dropna()` method with `subset`

emp.dropna(subset = ["Start Date", "Salary"])

We can also pass the subset parameter a list of variables.

Pandas Basics

Dealing with Missing Values: The `dropna()` method with `thresh`

emp.dropna(thresh = 4)

The thresh parameter specifies a minimum threshold of non-missing values that an observation must have for pandas to keep it.

Dealing with Duplicates

Pandas Basics

Dealing with Duplicates with the `duplicated()` method

Missing values are a common occurrence in messy data sets, and so are duplicate values.

emp["Team"].duplicated()

The duplicated() method returns a Boolean Series that identifies duplicates in a variable.

Pandas Basics

Dealing with Duplicates with the `duplicated()` method

emp["Team"].duplicated(keep = "first")
emp["Team"].duplicated(keep = "last")
~emp["Team"].duplicated()

The duplicated() method’s keep parameter informs pandas which duplicate occurrence to keep.
- Its default argument, "first", keeps the first occurrence of each duplicate value.
- Its argument, "last", keeps the last occurrence of each duplicate value.
Q. How can we keep observations with the first occurrences of a value in the Team variable?

Pandas Basics

Dealing with Duplicates with the `drop_duplicates()` method

emp.drop_duplicates()

The drop_duplicates() method removes observations in which all values are equal to those in a previously encountered observations.

Pandas Basics

Dealing with Duplicates with the `drop_duplicates()` method

Below is an example of the drop_duplicates() method:

# Sample DataFrame with duplicate observations
data = {
    'Name': ['John', 'Anna', 'John', 'Mike', 'Anna'],
    'Age': [28, 23, 28, 32, 23],
    'City': ['New York', 'Paris', 'New York', 'London', 'Paris']
}

# pd.DataFrame( Series, List, or Dict ) creates a DataFrame
df = pd.DataFrame(data)  
df_unique = df.drop_duplicates()

Pandas Basics

Dealing with Duplicates with the `drop_duplicates()` method

emp.drop_duplicates(subset = ["Team"])

We can pass the drop_duplicates() method a subset parameter with a list of variables that pandas should use to determine an observation’s uniqueness.
- The above example finds the first occurrence of each unique value in the ‘Team’ variable.

Pandas Basics

Dealing with Duplicates with the `drop_duplicates()` method

emp.drop_duplicates(subset = ["Gender", "Team"])

The above example uses a combination of values across the Gender and Team variables to identify duplicates.

Pandas Basics

Dealing with Duplicates with the `drop_duplicates()` method

emp.drop_duplicates(subset = ["Team"], keep = "last")
emp.drop_duplicates(subset = ["Team"], keep = False)

The drop_duplicates() method also accepts a keep parameter.
- We can pass it an argument of "last" to keep the observations with each duplicate value’s last occurrence.
- We can pass it an argument of False to exclude all observations with duplicate values.
Q. What does emp.drop_duplicates(subset = ["First Name"], keep = False) do?
Q. Find a subset of all employees with a First Name of “Douglas” and a Gender of “Male”. Then check which “Douglas” is in the DataFrame emp.drop_duplicates(subset = ["Gender", "Team"]).

Pandas Basics

Let’s do Questions 1-3 in Part 1 of Classwork 5!

Group Operations

The pandas library’s GroupBy object is a storage container for grouping observations into buckets.
It provides a set of methods to aggregate and analyze each independent group in the collection.
It allows us to extract observations at specific index positions within each group.
It also offers a convenient way to iterate over the groups of observations.

Group Operations

Creating a `GroupBy` Object

pd.DataFrame() converts a List/Dictionary/Series into a DataFrame ojbect

# Creating a dictionary
food_data = {
"Item": ["Apple", "Onion", "Orange", "Tomato", "Watermelon"],
"Type": ["Fruit", "Vegie", "Fruit", "Vegie", "Fruit"],
"Price": [1.05, 1.00, 1.25, 0.85, 4.15]
}

# Converting a dictionary into a DataFrame
supermarket = pd.DataFrame(data = food_data)

Group Operations

Creating a `GroupBy` Object

The Type variable identifies the group to which an Item belongs.
- There are two groups of items in the supermarket data set: “Fruit” and “Vegie”.
- We can use terms such as groups, buckets, and clusters interchangeably to describe the same idea.
- Multiple observations fall into the same category.

Group Operations

Creating a `GroupBy` Object

The GroupBy object implicitly organizes observations into buckets based on shared values in a categorical variable.

Group Operations

Creating a `GroupBy` Object

Suppose that we are interested in the average price of a fruit and the average price of a vegie.
- If we could isolate the “Fruit” observations and “Vegie” observations into separate groups, it would be easier to perform the calculations.

Group Operations

Creating a `GroupBy` Object

Let’s begin by invoking the groupby() method on the supermarket DataFrame.
- We need to pass groupby() a categorical variable whose values will be used to create the groups.

groups = supermarket.groupby("Type")
groups

The DataFrameGroupBy object is separate and distinct from a DataFrame

Group Operations

Creating a `GroupBy` Object

The “Type” variable has two unique values, so the GroupBy object will store two groups.

groups.get_group("Fruit")
groups.get_group("Vegie")

The get_group() method accepts a group name and returns a DataFrame with the corresponding observations.

Group Operations

Creating a `GroupBy` Object

Our original goal was to calculate the average price of the fruits and vegetables in supermarket.

groups.mean()

With a few lines of code, we have successfully split, aggregated, and analyzed a data set

Group Operations

Fortune 1000 dataset

The Fortune 1000 is a listing of the 1,000 largest companies in the United States by revenue.
- The list is updated annually by the business magazine Fortune.
- The fortune_2023.csv file is a collection of Fortune 1000 companies from 2023 (Source: Kaggle).

fortune1000 = pd.read_csv("https://bcdanl.github.io/data/fortune_2023.csv")

varlist = ['Company', 'Revenues_M', 'Profits_M', 'Number_of_employees', 'Sector', 'Industry']
fortune = fortune1000[varlist]

A sector can have many companies; An industry is a subcategory within a sector.
Let’s explore the fortune DataFrame.
- How many unique sectors are in fortune?

Group Operations

Creating a `GroupBy` object from a dataset

Let’s pull out all companies with a Sector value of "Retailing".
- Then, calculate the Retailing sector’s average revenue:

in_retailing = ( fortune["Sector"] == "Retailing" )
retail_companies = fortune[ in_retailing ]
retail_companies["Revenues"].mean()

Without a group operation, we may need to write a lot of additional code to apply the same logic to the other 20 sectors in fortune.
- Pandas’ GroupBy object offers the best solution out of the box.

Group Operations

Creating a `GroupBy` object from a dataset

sectors = fortune.groupby("Sector")
len(sectors)    # fortune["Sector"].nunique()
sectors.size()

We can count the number of groups in sectors by passing the GroupBy object into the Python’s built-in len() function
The size() method on the GroupBy object returns a Series with an alphabetical list of the groups and their observation counts.

Group Operations

Attributes and methods of a `GroupBy` object

sectors.groups

The groups attribute stores a dictionary with associations of group-to-observations.

Group Operations

Attributes and methods of a `GroupBy` object

What if we want to find the highest-performing company (by revenue) within each sector?

sectors.first()
sectors.last()

The GroupBy object’s first()/last() method extracts the first/last observation listed for each group in a DataFrame.
- Since our fortune DataFrame is sorted by Revenue_M, the first company pulled out for each sector will be the highest-performing company within that sector.

Group Operations

Attributes and methods of a `GroupBy` object

sectors.nth(0)
sectors.nth(1)
fortune[fortune["Sector"] == "Apparel"]

The nth() method is used with a GroupBy object to select the nth observation from each group.
- Here we can confirm the output is correct by filtering for the “Apparel” observations in fortune.

Group Operations

Attributes and methods of a `GroupBy` object

sectors.head(2)
sectors.tail(2)

The head(n)/tail(n) method extracts the first/last n observations from each group.

Group Operations

Attributes and methods of a `GroupBy` object

sectors.get_group("Energy")
type( sectors.get_group("Energy") )

We can use the get_group() method to extract all observations in a given group.
- The method returns a DataFrame containing the observations.

Group Operations

Aggregation

We can invoke methods on the GroupBy object to apply aggregate operations to every group.
- Aggregation is the process of taking multiple values and returning a single value.

sectors.sum()
sectors.mean()

Group Operations

Aggregation

sectors["Revenues_M"]
sectors["Revenues"].sum()
sectors["Revenues"].mean()
sectors["Revenues"].max()
sectors["Revenues"].min()

We can target a single variable by passing its name inside square brackets after the GroupBy object.
- Pandas returns a new object, a SeriesGroupBy.

Group Operations

Aggregation

sectors.agg(
  Revenues_M_min = ("Revenues_M", "min"),
  Profits_M_max = ("Profits_M", "max"),
  Number_of_employees_mean = ("Revenues_M", "mean")
)

The agg method applies multiple aggregate operations to different variables and can accept a tuple as its argument.

Group Operations

Aggregation

sectors["Revenues_M"].agg("mean")

The agg() method can also be used on a Series.

Group Operations

Aggregation

We pass in whatever aggregation we want.
- Some common options are in the table above.

Group Operations

Applying a Custom Operation to All Groups

Suppose that we want to apply a custom operation to each group in a GroupBy object.
- For practice, let’s now shuffle observations in fortune.

fortune = (
    fortune
    .sample( frac = 1 )
    .reset_index( drop = True )
)

sample(frac=1): This samples 100% of the observations from the DataFrame, effectively shuffling it.
- The frac parameter specifies the fraction of observations to return in the random sample, so frac=1 means “return all observations”.

Group Operations

Applying a Custom Operation to All Groups

Suppose that we want to apply a custom operation to each group in a GroupBy object.
- For practice, let’s now shuffle observations in fortune.

fortune_shuffled = (
    fortune
    .sample( frac = 1 )
    .reset_index( drop = True )
)

reset_index(): This resets the index of the DataFrame.
- The drop = True option is used to prevent the old index from being added as a variable in the new DataFrame.

Group Operations

Let’s do Questions 1-6 in Part 2 of Classwork 5!

Group Operations

Applying a Custom Operation to All Groups

How can we identify top 5 companies in each sector?
1. sort_values() with groupby()
2. nlargest() with groupby()
However, DataFrameGroupBy has no methods sort_values() or nlargest().

Primer on Custom Functions

A function can take any number and type of input parameters and return any number and type of output results.
We can do two things with a function:
- Define it
- Call it

Primer on Custom Functions

To define a Python function, we type def, the function name, parentheses enclosing any input parameters to the function, and then finally, a colon (:).
Let’s define a very simple function my_half() that has a parameter x and returns a value x / 2.

def my_half(x):
    return x / 2

The values we pass into the function when we call it are known as arguments.
When we call a function with arguments, the values of those arguments are copied to their corresponding parameters inside the function.

Group Operations

Applying a Custom Operation to All Groups

How can we apply the nlargest() method to each group?
- First, define a custom function that accepts a single argument: a DataFrame.
- Second, pass the apply() method the custom function.

def get_largest_obs(df):
    return df.nlargest(1, "Revenues_M", keep="all")
    
sectors.apply(get_largest_obs)

We can use the apply() method when pandas does not support a custom aggregation we would like to apply to each group.

Group Operations

Grouping by Multiple Variables

sector_and_industry = fortune.groupby(["Sector", "Industry"])

We can create a GroupBy object with values from multiple variables.
- This operation is optimal when a combination of variables serves as the best identifier for a group.

Group Operations

Grouping by Multiple Variables

(
    sector_and_industry
    .size()
    .reset_index(name = "n")
)

The GroupBy object’s size() method now returns a MultiIndex Series with a count of observations for each internal group.
The reset_index() method can be used to convert a Series into a DataFrame.
- The name option renames a variable of index when resetting index.

Group Operations

Grouping by Multiple Variables

sector_and_industry.get_group(("Business Services", "Financial Data Services"))

The get_group() method requires a tuple of values to extract a nested DataFrame from the GroupBy collection.

Group Operations

Grouping by Multiple Variables

sector_and_industry.sum()
sector_and_industry["Revenues_M"].mean()

(
    sector_and_industry["Revenues_M"]
    .mean()
    .reset_index(name = "Revenues_mean")
)

For all aggregations, pandas returns a MultiIndex DataFrame with the calculations.

Group Operations

Let’s do Question 7 in Part 2 of Classwork 5!

Lecture 5

Dealing with Missing Values

Pandas Basics

Pandas Basics

Dealing with Missing Values

Pandas Basics

Dealing with Missing Values: The isna() and notna() methods

Pandas Basics

Dealing with Missing Values: The isna() and notna() methods

Pandas Basics

Dealing with Missing Values: The value_counts(dropna = False) method

Pandas Basics

Dealing with Missing Values: The dropna() method

Pandas Basics

Dealing with Missing Values: The dropna() method with how

Pandas Basics

Dealing with Missing Values: The dropna() method with subset

Pandas Basics

Dealing with Missing Values: The dropna() method with subset

Pandas Basics

Dealing with Missing Values: The dropna() method with thresh

Dealing with Duplicates

Pandas Basics

Dealing with Duplicates with the duplicated() method

Pandas Basics

Dealing with Duplicates with the duplicated() method

Pandas Basics

Dealing with Duplicates with the drop_duplicates() method

Pandas Basics

Dealing with Duplicates with the drop_duplicates() method

Pandas Basics

Dealing with Duplicates with the drop_duplicates() method

Pandas Basics

Dealing with Duplicates with the drop_duplicates() method

Pandas Basics

Dealing with Duplicates with the drop_duplicates() method

Pandas Basics

Group Operations

Group Operations

Group Operations

Creating a GroupBy Object

Group Operations

Creating a GroupBy Object

Group Operations

Creating a GroupBy Object

Group Operations

Creating a GroupBy Object

Group Operations

Creating a GroupBy Object

Group Operations

Creating a GroupBy Object

Group Operations

Creating a GroupBy Object

Group Operations

Fortune 1000 dataset

Group Operations

Creating a GroupBy object from a dataset

Group Operations

Creating a GroupBy object from a dataset

Group Operations

Attributes and methods of a GroupBy object

Group Operations

Attributes and methods of a GroupBy object

Group Operations

Attributes and methods of a GroupBy object

Group Operations

Attributes and methods of a GroupBy object

Group Operations

Attributes and methods of a GroupBy object

Group Operations

Aggregation

Group Operations

Aggregation

Group Operations

Aggregation

Group Operations

Aggregation

Group Operations

Aggregation

Group Operations

Dealing with Missing Values: The `isna()` and `notna()` methods

Dealing with Missing Values: The `isna()` and `notna()` methods

Dealing with Missing Values: The `value_counts(dropna = False)` method

Dealing with Missing Values: The `dropna()` method

Dealing with Missing Values: The `dropna()` method with `how`

Dealing with Missing Values: The `dropna()` method with `subset`

Dealing with Missing Values: The `dropna()` method with `subset`

Dealing with Missing Values: The `dropna()` method with `thresh`

Dealing with Duplicates with the `duplicated()` method

Dealing with Duplicates with the `duplicated()` method

Dealing with Duplicates with the `drop_duplicates()` method

Dealing with Duplicates with the `drop_duplicates()` method

Dealing with Duplicates with the `drop_duplicates()` method

Dealing with Duplicates with the `drop_duplicates()` method

Dealing with Duplicates with the `drop_duplicates()` method

Creating a `GroupBy` Object

Creating a `GroupBy` Object

Creating a `GroupBy` Object

Creating a `GroupBy` Object

Creating a `GroupBy` Object

Creating a `GroupBy` Object

Creating a `GroupBy` Object

Creating a `GroupBy` object from a dataset

Creating a `GroupBy` object from a dataset

Attributes and methods of a `GroupBy` object

Attributes and methods of a `GroupBy` object

Attributes and methods of a `GroupBy` object

Attributes and methods of a `GroupBy` object

Attributes and methods of a `GroupBy` object