Lecture 11

pandas Basics - Missing Values; Duplicates

Byeong-Hak Choe

SUNY Geneseo

February 21, 2025

Dealing with Missing Values

Dealing with Missing Values

  • Let’s read employment.csv as emp.
import pandas as pd
# Below is for an interactive display of DataFrame in Colab
from google.colab import data_table
data_table.enable_dataframe_formatter()

emp = pd.read_csv("https://bcdanl.github.io/data/employment.csv")

Dealing with Missing Values

  • Pandas often marks (1) missing text values and (2) missing numeric values with a NaN (not a number);
    • It also marks missing datetime values with a NaT (not a time).

Dealing with Missing Values: The isna() and notna() methods

emp["Team"].isna()
emp["Start Date"].isna()
  • The isna() method returns a Boolean Series in which True denotes that an observation’s value is missing.
    • Is a value of a variable “XYZ” missing?

Dealing with Missing Values: The isna() and notna() methods

# Below two are equivalent.
emp["Team"].notna()
~emp["Team"].isna()
  • The notna() method returns the inverse Series, one in which True indicates that an observation’s value is present.

  • We use the tilde symbol (~) to invert a Boolean Series.

  • Q. How can we pull out employees with non-missing Team values?

Dealing with Missing Values: The value_counts(dropna = False) method

emp["Mgmt"].isna().sum()
emp["Mgmt"].value_counts()
emp["Mgmt"].value_counts(dropna = False)
  • One way to missing data counts is to use the isna().sum() on a Series.
    • True is 1 and False is 0.
  • Another way to get missing data counts is to use the .value_counts() method on a Series.
    • If we use the dropna = False option, we can also get a missing value count.

Dealing with Missing Values: The dropna() method

emp.dropna()
  • The dropna() method removes observations that hold any NaN or NaT values.

Dealing with Missing Values: The dropna() method with how

emp.dropna(how = "all")
  • We can pass the how parameter an argument of "all" to remove observations in which all values are missing.

  • Note that the how parameter’s default argument is "any".

Dealing with Missing Values: The dropna() method with subset

emp.dropna(subset = ["Gender"])
  • We can use the subset parameter to target observations with a missing value in a specific variable.
    • The above example removes observations that have a missing value in the Gender variable.

Dealing with Missing Values: The dropna() method with subset

emp.dropna(subset = ["Start Date", "Salary"])
  • We can also pass the subset parameter a list of variables.

Dealing with Duplicates

Dealing with Duplicates with the duplicated() method

  • Missing values are a common occurrence in messy data sets, and so are duplicate values.
emp["Team"].duplicated()
  • The duplicated() method returns a Boolean Series that identifies duplicates in a variable.

Dealing with Duplicates with the duplicated() method

emp["Team"].duplicated(keep = "first")
emp["Team"].duplicated(keep = "last")
~emp["Team"].duplicated()
  • The duplicated() method’s keep parameter informs pandas which duplicate occurrence to keep.
    • Its default argument, "first", keeps the first occurrence of each duplicate value.
    • Its argument, "last", keeps the last occurrence of each duplicate value.

Dealing with Duplicates with the drop_duplicates() method

emp.drop_duplicates()
  • The drop_duplicates() method removes observations in which all values are equal to those in a previously encountered observations.

Dealing with Duplicates with the drop_duplicates() method

  • Below is an example of the drop_duplicates() method:
# Sample DataFrame with duplicate observations
data = {
    'Name': ['John', 'Anna', 'John', 'Mike', 'Anna'],
    'Age': [28, 23, 28, 32, 23],
    'City': ['New York', 'Paris', 'New York', 'London', 'Paris']
}

# pd.DataFrame( Series, List, or Dict ) creates a DataFrame
df = pd.DataFrame(data)  
df_unique = df.drop_duplicates()

Dealing with Duplicates with the drop_duplicates() method

emp.drop_duplicates(subset = ["Team"])
  • We can pass the drop_duplicates() method a subset parameter with a list of columns that pandas should use to determine an observation’s uniqueness.

    • The above example finds the first occurrence of each unique value in the Team variable.

Dealing with Duplicates with the drop_duplicates() method

emp.drop_duplicates(subset = ["Gender", "Team"])
  • The above example uses a combination of values across the Gender and Team variables to identify duplicates.

Dealing with Duplicates with the drop_duplicates() method

emp.drop_duplicates(subset = ["Team"], keep = "last")
emp.drop_duplicates(subset = ["Team"], keep = False)
  • The drop_duplicates() method also accepts a keep parameter.

    • We can pass it an argument of "last" to keep the observations with each duplicate value’s last occurrence.
    • We can pass it an argument of False to exclude all observations with duplicate values.
  • Q. What does emp.drop_duplicates(subset = ["First Name"], keep = False) do?

  • Q. Find a subset of all employees with a First Name of “Douglas” and a Gender of “Male”. Then check which “Douglas” is in the DataFrame emp.drop_duplicates(subset = ["Gender", "Team"]).

Pandas Basics

Let’s do Questions 7-8 in Classwork 6!