Lecture 9

Pandas Fundamentals IV: Missing Values; Duplicate Values

Byeong-Hak Choe

SUNY Geneseo

April 6, 2026

📂 `emp` DataFrame

Let’s read employment.csv as emp.

import pandas as pd
import numpy as np

# Below is for an interactive display of DataFrame in Colab
from google.colab import data_table
data_table.enable_dataframe_formatter()

emp = pd.read_csv("https://bcdanl.github.io/data/employment.csv")

🕳️ Dealing with Missing Values

📂 Reloading `emp` for Missing-Value Examples

Let’s read employment.csv as emp.

import pandas as pd
# Below is for an interactive display of DataFrame in Colab
from google.colab import data_table
data_table.enable_dataframe_formatter()

emp = pd.read_csv("https://bcdanl.github.io/data/employment.csv")

❓ What `NaN` and `NaT` Mean in pandas

pandas often marks missing text and numeric values with NaN (not a number).
- Missing datetime values are usually marked with NaT (not a time).

🔎 Detecting Missing Values with `isna()`

emp["Team"].isna()
emp["Start Date"].isna()

The isna() method returns a Boolean Series in which True denotes that an observation’s value is missing.
- Is a value of a variable “XYZ” missing?

✅ Detecting Non-Missing Values with `notna()`

# Below two are equivalent.
emp["Team"].notna()
~emp["Team"].isna()

The notna() method returns the inverse Series, one in which True indicates that an observation’s value is present.
We use the tilde symbol (~) to invert a Boolean Series.
Q. How can we pull out employees with non-missing Team values?

🔢 Counting Missing Values in Different Ways

emp["Mgmt"].isna().sum()
emp["Mgmt"].value_counts()
emp["Mgmt"].value_counts(dropna = False)

One way to count missing values is to use isna().sum() on a Series.
- Here, True is treated like 1 and False like 0.
Another way is to use .value_counts() on a Series.
- With dropna = False, pandas also reports the number of missing values.

🗑️ Removing Rows with Any Missing Values

emp.dropna()

The dropna() method removes observations that hold any NaN or NaT values.

🧹 Removing Rows Only When Everything Is Missing

emp.dropna(how = "all")

We can pass the how parameter an argument of "all" to remove observations in which all values are missing.
Note that the how parameter’s default argument is "any".

🎯 Dropping Rows Based on One Key Column

emp.dropna(subset = ["Gender"])

We can use the subset parameter to target observations with a missing value in a specific variable.
- The above example removes observations that have a missing value in the Gender variable.

🎯 Dropping Rows Based on Multiple Key Columns

emp.dropna(subset = ["Start Date", "Salary"])

We can also pass the subset parameter a list of variables.

👯 Dealing with Duplicates

🪞 Spotting Repeated Values with `duplicated()`

Missing values are a common occurrence in messy data sets, and so are duplicate values.

emp["Team"].duplicated()

The duplicated() method returns a Boolean Series that identifies whether a value has already appeared earlier in the variable.

🔁 Controlling Which Duplicate to Keep

emp["Team"].duplicated(keep = "first")
emp["Team"].duplicated(keep = "last")
~emp["Team"].duplicated()

The duplicated() method’s keep parameter informs pandas which duplicate occurrence to keep.
- 'first' (default): Marks duplicates after the first as True
- 'last': Marks duplicates before the last as True
- False: Marks all duplicates as True

📌 Detecting Duplicates Using One Column

emp.duplicated(subset = ["Team"])

We can pass a subset argument to duplicated() so that pandas uses only selected columns to determine duplication.
- The above example marks rows as True when the value in Team has already appeared earlier.

🧩 Detecting Duplicates Using a Column Combination

emp.duplicated(subset = ["Gender", "Team"])

The above example uses a combination of values across the Gender and Team variables to identify duplicates.
- A row is marked as True only when the same combination of Gender and Team has appeared before.

✂️ Removing Fully Duplicated Rows

emp.drop_duplicates()

The drop_duplicates() method removes observations whose full rows are identical to rows that appeared earlier.

🧪 Small Example of `drop_duplicates()`

Below is an example of the drop_duplicates() method:

# Sample DataFrame with duplicate observations
data = {
    'Name': ['John', 'Anna', 'John', 'Mike', 'Anna'],
    'Age': [28, 23, 28, 32, 23],
    'City': ['New York', 'Paris', 'New York', 'London', 'Paris']
}

# pd.DataFrame( Series, List, or Dict ) creates a DataFrame
df = pd.DataFrame(data)  
df_unique = df.drop_duplicates()

📌 Dropping Duplicates Using One Column

emp.drop_duplicates(subset = ["Team"])

We can pass a subset argument to drop_duplicates() so that pandas uses only selected columns to determine uniqueness.
- The above example keeps the first occurrence of each unique value in the Team variable.

🧩 Dropping Duplicates Using a Column Combination

emp.drop_duplicates(subset = ["Gender", "Team"])

The above example uses a combination of values across the Gender and Team variables to identify duplicates.

⚙️ Using `keep` Inside `drop_duplicates()`

emp.drop_duplicates(subset = ["Team"], keep = "last")
emp.drop_duplicates(subset = ["Team"], keep = False)

The drop_duplicates() method also accepts a keep parameter.
- 'first' (default): Keeps the first occurrence, removes the rest
- 'last': Keeps the last occurrence, removes the rest
- False: Removes all duplicates, keeping only unique entries
Q. What does emp.drop_duplicates(subset = ["First Name"], keep = False) do?
Q. Find a subset of all employees with a First Name of “Douglas” and a Gender of “Male”. Then check which “Douglas” is in the DataFrame emp.drop_duplicates(subset = ["Gender", "Team"]).

🚀 Classwork 13: Pandas Fundamentals

Let’s do Classwork 13!

Lecture 9

📂 emp DataFrame

🕳️ Dealing with Missing Values

📂 Reloading emp for Missing-Value Examples

❓ What NaN and NaT Mean in pandas

🔎 Detecting Missing Values with isna()

✅ Detecting Non-Missing Values with notna()

🔢 Counting Missing Values in Different Ways

🗑️ Removing Rows with Any Missing Values

🧹 Removing Rows Only When Everything Is Missing

🎯 Dropping Rows Based on One Key Column

🎯 Dropping Rows Based on Multiple Key Columns

👯 Dealing with Duplicates

🪞 Spotting Repeated Values with duplicated()

🔁 Controlling Which Duplicate to Keep

📌 Detecting Duplicates Using One Column

🧩 Detecting Duplicates Using a Column Combination

✂️ Removing Fully Duplicated Rows

🧪 Small Example of drop_duplicates()

📌 Dropping Duplicates Using One Column

🧩 Dropping Duplicates Using a Column Combination

⚙️ Using keep Inside drop_duplicates()

🚀 Classwork 13: Pandas Fundamentals

📂 `emp` DataFrame

📂 Reloading `emp` for Missing-Value Examples

❓ What `NaN` and `NaT` Mean in pandas

🔎 Detecting Missing Values with `isna()`

✅ Detecting Non-Missing Values with `notna()`

🪞 Spotting Repeated Values with `duplicated()`

🧪 Small Example of `drop_duplicates()`

⚙️ Using `keep` Inside `drop_duplicates()`