pandas Basics - Missing Values; Duplicates
February 21, 2025
NaN (not a number);
NaT (not a time).isna() and notna() methodsisna() method returns a Boolean Series in which True denotes that an observation’s value is missing.
isna() and notna() methodsThe notna() method returns the inverse Series, one in which True indicates that an observation’s value is present.
We use the tilde symbol (~) to invert a Boolean Series.
Q. How can we pull out employees with non-missing Team values?
value_counts(dropna = False) methodisna().sum() on a Series.
True is 1 and False is 0..value_counts() method on a Series.
dropna = False option, we can also get a missing value count.dropna() methoddropna() method removes observations that hold any NaN or NaT values.dropna() method with howWe can pass the how parameter an argument of "all" to remove observations in which all values are missing.
Note that the how parameter’s default argument is "any".
dropna() method with subsetsubset parameter to target observations with a missing value in a specific variable.
Gender variable.dropna() method with subsetsubset parameter a list of variables.duplicated() methodduplicated() method returns a Boolean Series that identifies duplicates in a variable.duplicated() methodemp["Team"].duplicated(keep = "first")
emp["Team"].duplicated(keep = "last")
~emp["Team"].duplicated()duplicated() method’s keep parameter informs pandas which duplicate occurrence to keep.
"first", keeps the first occurrence of each duplicate value."last", keeps the last occurrence of each duplicate value.drop_duplicates() methoddrop_duplicates() method removes observations in which all values are equal to those in a previously encountered observations.drop_duplicates() methoddrop_duplicates() method:# Sample DataFrame with duplicate observations
data = {
'Name': ['John', 'Anna', 'John', 'Mike', 'Anna'],
'Age': [28, 23, 28, 32, 23],
'City': ['New York', 'Paris', 'New York', 'London', 'Paris']
}
# pd.DataFrame( Series, List, or Dict ) creates a DataFrame
df = pd.DataFrame(data)
df_unique = df.drop_duplicates()drop_duplicates() methodWe can pass the drop_duplicates() method a subset parameter with a list of columns that pandas should use to determine an observation’s uniqueness.
drop_duplicates() methodGender and Team variables to identify duplicates.drop_duplicates() methodemp.drop_duplicates(subset = ["Team"], keep = "last")
emp.drop_duplicates(subset = ["Team"], keep = False)The drop_duplicates() method also accepts a keep parameter.
"last" to keep the observations with each duplicate value’s last occurrence.False to exclude all observations with duplicate values.Q. What does emp.drop_duplicates(subset = ["First Name"], keep = False) do?
Q. Find a subset of all employees with a First Name of “Douglas” and a Gender of “Male”. Then check which “Douglas” is in the DataFrame emp.drop_duplicates(subset = ["Gender", "Team"]).
Let’s do Questions 7-8 in Classwork 6!