Pandas Fundamentals IV: Missing Values; Duplicate Values
April 6, 2026
emp DataFrameemp for Missing-Value ExamplesNaN and NaT Mean in pandasNaN (not a number).
NaT (not a time).isna()isna() method returns a Boolean Series in which True denotes that an observation’s value is missing.
notna()The notna() method returns the inverse Series, one in which True indicates that an observation’s value is present.
We use the tilde symbol (~) to invert a Boolean Series.
Q. How can we pull out employees with non-missing Team values?
isna().sum() on a Series.
True is treated like 1 and False like 0..value_counts() on a Series.
dropna = False, pandas also reports the number of missing values.dropna() method removes observations that hold any NaN or NaT values.We can pass the how parameter an argument of "all" to remove observations in which all values are missing.
Note that the how parameter’s default argument is "any".
subset parameter to target observations with a missing value in a specific variable.
Gender variable.subset parameter a list of variables.duplicated()duplicated() method returns a Boolean Series that identifies whether a value has already appeared earlier in the variable.emp["Team"].duplicated(keep = "first")
emp["Team"].duplicated(keep = "last")
~emp["Team"].duplicated()duplicated() method’s keep parameter informs pandas which duplicate occurrence to keep.
'first' (default): Marks duplicates after the first as True'last': Marks duplicates before the last as TrueFalse: Marks all duplicates as TrueWe can pass a subset argument to duplicated() so that pandas uses only selected columns to determine duplication.
True when the value in Team has already appeared earlier.The above example uses a combination of values across the Gender and Team variables to identify duplicates.
True only when the same combination of Gender and Team has appeared before.drop_duplicates() method removes observations whose full rows are identical to rows that appeared earlier.drop_duplicates()drop_duplicates() method:# Sample DataFrame with duplicate observations
data = {
'Name': ['John', 'Anna', 'John', 'Mike', 'Anna'],
'Age': [28, 23, 28, 32, 23],
'City': ['New York', 'Paris', 'New York', 'London', 'Paris']
}
# pd.DataFrame( Series, List, or Dict ) creates a DataFrame
df = pd.DataFrame(data)
df_unique = df.drop_duplicates()We can pass a subset argument to drop_duplicates() so that pandas uses only selected columns to determine uniqueness.
Team variable.Gender and Team variables to identify duplicates.keep Inside drop_duplicates()emp.drop_duplicates(subset = ["Team"], keep = "last")
emp.drop_duplicates(subset = ["Team"], keep = False)The drop_duplicates() method also accepts a keep parameter.
'first' (default): Keeps the first occurrence, removes the rest'last': Keeps the last occurrence, removes the restFalse: Removes all duplicates, keeping only unique entriesQ. What does emp.drop_duplicates(subset = ["First Name"], keep = False) do?
Q. Find a subset of all employees with a First Name of “Douglas” and a Gender of “Male”. Then check which “Douglas” is in the DataFrame emp.drop_duplicates(subset = ["Gender", "Team"]).
Let’s do Classwork 13!