Pandas Fundamentals IV: Counting Options; Missing Values; Duplicate Values
April 13, 2026
.value_counts().value_counts() examplesimport pandas as pd
import numpy as np
# Below is for an interactive display of DataFrame in Colab
from google.colab import data_table
data_table.enable_dataframe_formatter()
ncaa = pd.DataFrame({
'Team': ['Lakers', 'Celtics', 'Lakers', 'Bulls', 'Heat', 'Bulls', 'Lakers', 'Heat', np.nan, 'Celtics'],
'Position': ['PG', 'F', 'G', 'C', 'G', 'F', 'F', 'C', 'G', 'F'],
'Age': [25, 29, 25, 31, 27, 22, 30, 27, 24, 29],
'College': ['Duke', 'Kentucky', np.nan, 'UCLA', 'Duke', 'Gonzaga', 'Arizona', np.nan, 'Duke', 'Kentucky']
})dropna = Falsedropna = False lets us count NaN values too.normalize = Truesort = False.value_counts() sorts from the most frequent to the least frequent.sort = False keeps the categories in the order they first appear.ascending = Truebins =Let’s do Classwork 13!
emp DataFrameemp for Missing-Value Examples<NA>, NaN, and NaT Mean in pandasDataFrame, special markers are used to represent missing values.
NaN indicates a missing numeric value.NaT indicates a missing date or time value.<NA> is pandas’ general missing-value marker.pd.NA is pandas’ general missing-value object, which is typically displayed as <NA>.np.nan is commonly used for missing numeric values and is typically displayed as NaN.pd.NaT is used for missing date or time values and is typically displayed as NaT.isna()isna() method returns a Boolean Series in which True denotes that an observation’s value is missing.
notna()Both isna() and notna() treat all common pandas missing-value markers as missing.
The notna() method returns the inverse Series, one in which True indicates that an observation’s value is present.
We use the tilde symbol (~) to invert a Boolean Series.
Q. How can we pull out employees with non-missing Team values?
isna().sum() on a Series.
True is treated like 1 and False like 0..value_counts() on a Series.
dropna = False, pandas also reports the number of missing values.np.where()Note
emp['high_salary'] = np.where(emp['Salary'] >= 100000, True, False) treats NaN values in Salary as not meeting the condition, so they are labeled False.
high_salary column value to <NA> whenever the Salary column value is missing (NaN)?dropna() method removes observations that hold any NaN or NaT values.We can pass the how parameter an argument of "all" to remove observations in which all values are missing.
Note that the how parameter’s default argument is "any".
subset parameter to target observations with a missing value in a specific variable.
Gender variable.subset parameter a list of variables.Let’s do Questions 1-6 in Classwork 14!
duplicated()duplicated() method returns a Boolean Series that identifies whether a value has already appeared earlier in the variable.emp["Team"].duplicated(keep = "first")
emp["Team"].duplicated(keep = "last")
~emp["Team"].duplicated()duplicated() method’s keep parameter informs pandas which duplicate occurrence to keep.
'first' (default): Marks duplicates after the first as True'last': Marks duplicates before the last as TrueFalse: Marks all duplicates as TrueNote
emp["Team"].duplicated() treats NaN values as duplicate-able.
NaN is marked False, and subsequent NaN values are marked True (with keep="first"), because pandas considers NaN values as matching when checking for duplicates.We can pass a subset argument to duplicated() so that pandas uses only selected columns to determine duplication.
True when the value in Team has already appeared earlier.The above example uses a combination of values across the Gender and Team variables to identify duplicates.
True only when the same combination of Gender and Team has appeared before.drop_duplicates() method removes observations whose full rows are identical to rows that appeared earlier.drop_duplicates()drop_duplicates() method:# Sample DataFrame with duplicate observations
data = {
'Name': ['John', 'Anna', 'John', 'Mike', 'Anna'],
'Age': [28, 23, 28, 32, 23],
'City': ['New York', 'Paris', 'New York', 'London', 'Paris']
}
# pd.DataFrame( Series, List, or Dict ) creates a DataFrame
df = pd.DataFrame(data)
df_unique = df.drop_duplicates()We can pass a subset argument to drop_duplicates() so that pandas uses only selected columns to determine uniqueness.
Team variable.Gender and Team variables to identify duplicates.keep Inside drop_duplicates()emp.drop_duplicates(subset = ["Team"], keep = "last")
emp.drop_duplicates(subset = ["Team"], keep = False)The drop_duplicates() method also accepts a keep parameter.
'first' (default): Keeps the first occurrence, removes the rest'last': Keeps the last occurrence, removes the restFalse: Removes all duplicates, keeping only unique entriesQ. What does emp.drop_duplicates(subset = ["First Name"], keep = False) do?
Q. Find a subset of all employees with a First Name of “Douglas” and a Gender of “Male”. Then check which “Douglas” is in the DataFrame emp.drop_duplicates(subset = ["Gender", "Team"]).
Let’s do Questions 7-12 in Classwork 14!