Lecture 9

Pandas Fundamentals IV: Missing Values; Duplicate Values

Byeong-Hak Choe

SUNY Geneseo

April 6, 2026

📂 emp DataFrame

  • Let’s read employment.csv as emp.
import pandas as pd
import numpy as np

# Below is for an interactive display of DataFrame in Colab
from google.colab import data_table
data_table.enable_dataframe_formatter()

emp = pd.read_csv("https://bcdanl.github.io/data/employment.csv")

🕳️ Dealing with Missing Values

📂 Reloading emp for Missing-Value Examples

  • Let’s read employment.csv as emp.
import pandas as pd
# Below is for an interactive display of DataFrame in Colab
from google.colab import data_table
data_table.enable_dataframe_formatter()

emp = pd.read_csv("https://bcdanl.github.io/data/employment.csv")

❓ What NaN and NaT Mean in pandas

  • pandas often marks missing text and numeric values with NaN (not a number).
    • Missing datetime values are usually marked with NaT (not a time).

🔎 Detecting Missing Values with isna()

emp["Team"].isna()
emp["Start Date"].isna()
  • The isna() method returns a Boolean Series in which True denotes that an observation’s value is missing.
    • Is a value of a variable “XYZ” missing?

✅ Detecting Non-Missing Values with notna()

# Below two are equivalent.
emp["Team"].notna()
~emp["Team"].isna()
  • The notna() method returns the inverse Series, one in which True indicates that an observation’s value is present.

  • We use the tilde symbol (~) to invert a Boolean Series.

  • Q. How can we pull out employees with non-missing Team values?

🔢 Counting Missing Values in Different Ways

emp["Mgmt"].isna().sum()
emp["Mgmt"].value_counts()
emp["Mgmt"].value_counts(dropna = False)
  • One way to count missing values is to use isna().sum() on a Series.
    • Here, True is treated like 1 and False like 0.
  • Another way is to use .value_counts() on a Series.
    • With dropna = False, pandas also reports the number of missing values.

🗑️ Removing Rows with Any Missing Values

emp.dropna()
  • The dropna() method removes observations that hold any NaN or NaT values.

🧹 Removing Rows Only When Everything Is Missing

emp.dropna(how = "all")
  • We can pass the how parameter an argument of "all" to remove observations in which all values are missing.

  • Note that the how parameter’s default argument is "any".

🎯 Dropping Rows Based on One Key Column

emp.dropna(subset = ["Gender"])
  • We can use the subset parameter to target observations with a missing value in a specific variable.
    • The above example removes observations that have a missing value in the Gender variable.

🎯 Dropping Rows Based on Multiple Key Columns

emp.dropna(subset = ["Start Date", "Salary"])
  • We can also pass the subset parameter a list of variables.

👯 Dealing with Duplicates

🪞 Spotting Repeated Values with duplicated()

  • Missing values are a common occurrence in messy data sets, and so are duplicate values.
emp["Team"].duplicated()
  • The duplicated() method returns a Boolean Series that identifies whether a value has already appeared earlier in the variable.

🔁 Controlling Which Duplicate to Keep

emp["Team"].duplicated(keep = "first")
emp["Team"].duplicated(keep = "last")
~emp["Team"].duplicated()
  • The duplicated() method’s keep parameter informs pandas which duplicate occurrence to keep.
    • 'first' (default): Marks duplicates after the first as True
    • 'last': Marks duplicates before the last as True
    • False: Marks all duplicates as True

📌 Detecting Duplicates Using One Column

emp.duplicated(subset = ["Team"])
  • We can pass a subset argument to duplicated() so that pandas uses only selected columns to determine duplication.

    • The above example marks rows as True when the value in Team has already appeared earlier.

🧩 Detecting Duplicates Using a Column Combination

emp.duplicated(subset = ["Gender", "Team"])
  • The above example uses a combination of values across the Gender and Team variables to identify duplicates.

    • A row is marked as True only when the same combination of Gender and Team has appeared before.

✂️ Removing Fully Duplicated Rows

emp.drop_duplicates()
  • The drop_duplicates() method removes observations whose full rows are identical to rows that appeared earlier.

🧪 Small Example of drop_duplicates()

  • Below is an example of the drop_duplicates() method:
# Sample DataFrame with duplicate observations
data = {
    'Name': ['John', 'Anna', 'John', 'Mike', 'Anna'],
    'Age': [28, 23, 28, 32, 23],
    'City': ['New York', 'Paris', 'New York', 'London', 'Paris']
}

# pd.DataFrame( Series, List, or Dict ) creates a DataFrame
df = pd.DataFrame(data)  
df_unique = df.drop_duplicates()

📌 Dropping Duplicates Using One Column

emp.drop_duplicates(subset = ["Team"])
  • We can pass a subset argument to drop_duplicates() so that pandas uses only selected columns to determine uniqueness.

    • The above example keeps the first occurrence of each unique value in the Team variable.

🧩 Dropping Duplicates Using a Column Combination

emp.drop_duplicates(subset = ["Gender", "Team"])
  • The above example uses a combination of values across the Gender and Team variables to identify duplicates.

⚙️ Using keep Inside drop_duplicates()

emp.drop_duplicates(subset = ["Team"], keep = "last")
emp.drop_duplicates(subset = ["Team"], keep = False)
  • The drop_duplicates() method also accepts a keep parameter.

    • 'first' (default): Keeps the first occurrence, removes the rest
    • 'last': Keeps the last occurrence, removes the rest
    • False: Removes all duplicates, keeping only unique entries
  • Q. What does emp.drop_duplicates(subset = ["First Name"], keep = False) do?

  • Q. Find a subset of all employees with a First Name of “Douglas” and a Gender of “Male”. Then check which “Douglas” is in the DataFrame emp.drop_duplicates(subset = ["Gender", "Team"]).

🚀 Classwork 13: Pandas Fundamentals

Let’s do Classwork 13!