pandas
Basics - Missing Values; Duplicates
February 21, 2025
NaN
(not a number);
NaT
(not a time).isna()
and notna()
methodsisna()
method returns a Boolean Series
in which True
denotes that an observation’s value is missing.
isna()
and notna()
methodsThe notna()
method returns the inverse Series
, one in which True
indicates that an observation’s value is present.
We use the tilde symbol (~
) to invert a Boolean Series
.
Q. How can we pull out employees with non-missing Team
values?
value_counts(dropna = False)
methodisna().sum()
on a Series
.
True
is 1 and False
is 0..value_counts()
method on a Series
.
dropna = False
option, we can also get a missing value count.dropna()
methoddropna()
method removes observations that hold any NaN
or NaT
values.dropna()
method with how
We can pass the how
parameter an argument of "all"
to remove observations in which all values are missing.
Note that the how
parameter’s default argument is "any"
.
dropna()
method with subset
subset
parameter to target observations with a missing value in a specific variable.
Gender
variable.dropna()
method with subset
subset
parameter a list of variables.duplicated()
methodduplicated()
method returns a Boolean Series
that identifies duplicates in a variable.duplicated()
methodemp["Team"].duplicated(keep = "first")
emp["Team"].duplicated(keep = "last")
~emp["Team"].duplicated()
duplicated()
method’s keep
parameter informs pandas which duplicate occurrence to keep.
"first"
, keeps the first occurrence of each duplicate value."last"
, keeps the last occurrence of each duplicate value.drop_duplicates()
methoddrop_duplicates()
method removes observations in which all values are equal to those in a previously encountered observations.drop_duplicates()
methoddrop_duplicates()
method:# Sample DataFrame with duplicate observations
data = {
'Name': ['John', 'Anna', 'John', 'Mike', 'Anna'],
'Age': [28, 23, 28, 32, 23],
'City': ['New York', 'Paris', 'New York', 'London', 'Paris']
}
# pd.DataFrame( Series, List, or Dict ) creates a DataFrame
df = pd.DataFrame(data)
df_unique = df.drop_duplicates()
drop_duplicates()
methodWe can pass the drop_duplicates()
method a subset
parameter with a list of columns that pandas should use to determine an observation’s uniqueness.
drop_duplicates()
methodGender
and Team
variables to identify duplicates.drop_duplicates()
methodemp.drop_duplicates(subset = ["Team"], keep = "last")
emp.drop_duplicates(subset = ["Team"], keep = False)
The drop_duplicates()
method also accepts a keep
parameter.
"last"
to keep the observations with each duplicate value’s last occurrence.False
to exclude all observations with duplicate values.Q. What does emp.drop_duplicates(subset = ["First Name"], keep = False)
do?
Q. Find a subset of all employees with a First Name of “Douglas” and a Gender of “Male”. Then check which “Douglas” is in the DataFrame emp.drop_duplicates(subset = ["Gender", "Team"])
.
Let’s do Questions 7-8 in Classwork 6!