Lecture 9

Pandas Fundamentals IV: Counting Options; Missing Values; Duplicate Values

Byeong-Hak Choe

SUNY Geneseo

April 13, 2026

#️⃣🔢 Useful options for .value_counts()

A DataFrame for .value_counts() examples

import pandas as pd
import numpy as np

# Below is for an interactive display of DataFrame in Colab
from google.colab import data_table
data_table.enable_dataframe_formatter()

ncaa = pd.DataFrame({
    'Team': ['Lakers', 'Celtics', 'Lakers', 'Bulls', 'Heat', 'Bulls', 'Lakers', 'Heat', np.nan, 'Celtics'],
    'Position': ['PG', 'F', 'G', 'C', 'G', 'F', 'F', 'C', 'G', 'F'],
    'Age': [25, 29, 25, 31, 27, 22, 30, 27, 24, 29],
    'College': ['Duke', 'Kentucky', np.nan, 'UCLA', 'Duke', 'Gonzaga', 'Arizona', np.nan, 'Duke', 'Kentucky']
})

1. Include missing values with dropna = False

ncaa['College'].value_counts(dropna = False)
  • By default, missing values are excluded.
  • Setting dropna = False lets us count NaN values too.

2. Show proportions with normalize = True

ncaa['Team'].value_counts(normalize = True)
  • Instead of raw counts, this gives relative frequencies.
  • Each value shows the proportion of the total.

3. Keep the original order with sort = False

ncaa['Position'].value_counts(sort = False)
  • By default, .value_counts() sorts from the most frequent to the least frequent.
  • sort = False keeps the categories in the order they first appear.

4. Sort in ascending order with ascending = True

ncaa['Team'].value_counts(ascending = True)
  • This sorts counts from the smallest to the largest.
  • Helpful when we want to spot rare categories first.

5. Bin numeric values with bins =

ncaa['Age'].value_counts(bins = 4)
  • For numeric variables, we can group values into intervals.
  • This gives frequency counts by ranges instead of exact values.

6. Count combinations across multiple columns

ncaa[['Team', 'Position']].value_counts()
  • When we pass multiple columns, pandas counts unique combinations.
  • This is useful for two-way category summaries.

7. Normalize multi-column combinations

ncaa[['Team', 'Position']].value_counts(normalize = True)
  • This gives the proportion for each combination instead of the raw count.
  • Useful when comparing how common each pair is overall.

🚀 Classwork 13: Pandas Fundamentals

Let’s do Classwork 13!

📂 emp DataFrame

  • Let’s read employment.csv as emp.
import pandas as pd
import numpy as np

# Below is for an interactive display of DataFrame in Colab
from google.colab import data_table
data_table.enable_dataframe_formatter()

emp = pd.read_csv("https://bcdanl.github.io/data/employment.csv")

🕳️ Dealing with Missing Values

📂 Reloading emp for Missing-Value Examples

  • Let’s read employment.csv as emp.
import pandas as pd
# Below is for an interactive display of DataFrame in Colab
from google.colab import data_table
data_table.enable_dataframe_formatter()

emp = pd.read_csv("https://bcdanl.github.io/data/employment.csv")

❓ What <NA>, NaN, and NaT Mean in pandas

  • In a pandas DataFrame, special markers are used to represent missing values.
    • NaN indicates a missing numeric value.
    • NaT indicates a missing date or time value.
    • <NA> is pandas’ general missing-value marker.
  • More specifically:
    • pd.NA is pandas’ general missing-value object, which is typically displayed as <NA>.
    • np.nan is commonly used for missing numeric values and is typically displayed as NaN.
    • pd.NaT is used for missing date or time values and is typically displayed as NaT.

🔎 Detecting Missing Values with isna()

emp["Team"].isna()
emp["Start Date"].isna()
  • The isna() method returns a Boolean Series in which True denotes that an observation’s value is missing.
    • Is a value of a variable “XYZ” missing?

✅ Detecting Non-Missing Values with notna()

# Below two are equivalent.
emp["Team"].notna()
~emp["Team"].isna()
  • Both isna() and notna() treat all common pandas missing-value markers as missing.

  • The notna() method returns the inverse Series, one in which True indicates that an observation’s value is present.

  • We use the tilde symbol (~) to invert a Boolean Series.

  • Q. How can we pull out employees with non-missing Team values?

🔢 Counting Missing Values in Different Ways

emp["Mgmt"].isna().sum()
emp["Mgmt"].value_counts()
emp["Mgmt"].value_counts(dropna = False)
  • One way to count missing values is to use isna().sum() on a Series.
    • Here, True is treated like 1 and False like 0.
  • Another way is to use .value_counts() on a Series.
    • With dropna = False, pandas also reports the number of missing values.

🤔 Dealing with Missing Values in a Boolean Condition with np.where()

Note

emp['high_salary'] = np.where(emp['Salary'] >= 100000, True, False) treats NaN values in Salary as not meeting the condition, so they are labeled False.

  • Question: How can we set the high_salary column value to <NA> whenever the Salary column value is missing (NaN)?

🗑️ Removing Rows with Any Missing Values

emp.dropna()
  • The dropna() method removes observations that hold any NaN or NaT values.

🧹 Removing Rows Only When Everything Is Missing

emp.dropna(how = "all")
  • We can pass the how parameter an argument of "all" to remove observations in which all values are missing.

  • Note that the how parameter’s default argument is "any".

🎯 Dropping Rows Based on One Key Column

emp.dropna(subset = ["Gender"])
  • We can use the subset parameter to target observations with a missing value in a specific variable.
    • The above example removes observations that have a missing value in the Gender variable.

🎯 Dropping Rows Based on Multiple Key Columns

emp.dropna(subset = ["Start Date", "Salary"])
  • We can also pass the subset parameter a list of variables.

🚀 Classwork 14: Pandas Fundamentals

Let’s do Questions 1-6 in Classwork 14!

👯 Dealing with Duplicates

🪞 Spotting Repeated Values with duplicated()

  • Missing values are a common occurrence in messy data sets, and so are duplicate values.
emp["Team"].duplicated()
  • The duplicated() method returns a Boolean Series that identifies whether a value has already appeared earlier in the variable.

🔁 Controlling Which Duplicate to Keep

emp["Team"].duplicated(keep = "first")
emp["Team"].duplicated(keep = "last")
~emp["Team"].duplicated()
  • The duplicated() method’s keep parameter informs pandas which duplicate occurrence to keep.
    • 'first' (default): Marks duplicates after the first as True
    • 'last': Marks duplicates before the last as True
    • False: Marks all duplicates as True

Note

emp["Team"].duplicated() treats NaN values as duplicate-able.

  • The first NaN is marked False, and subsequent NaN values are marked True (with keep="first"), because pandas considers NaN values as matching when checking for duplicates.

📌 Detecting Duplicates Using One Column

emp.duplicated(subset = ["Team"])
  • We can pass a subset argument to duplicated() so that pandas uses only selected columns to determine duplication.

    • The above example marks rows as True when the value in Team has already appeared earlier.

🧩 Detecting Duplicates Using a Column Combination

emp.duplicated(subset = ["Gender", "Team"])
  • The above example uses a combination of values across the Gender and Team variables to identify duplicates.

    • A row is marked as True only when the same combination of Gender and Team has appeared before.

✂️ Removing Fully Duplicated Rows

emp.drop_duplicates()
  • The drop_duplicates() method removes observations whose full rows are identical to rows that appeared earlier.

🧪 Small Example of drop_duplicates()

  • Below is an example of the drop_duplicates() method:
# Sample DataFrame with duplicate observations
data = {
    'Name': ['John', 'Anna', 'John', 'Mike', 'Anna'],
    'Age': [28, 23, 28, 32, 23],
    'City': ['New York', 'Paris', 'New York', 'London', 'Paris']
}

# pd.DataFrame( Series, List, or Dict ) creates a DataFrame
df = pd.DataFrame(data)  
df_unique = df.drop_duplicates()

📌 Dropping Duplicates Using One Column

emp.drop_duplicates(subset = ["Team"])
  • We can pass a subset argument to drop_duplicates() so that pandas uses only selected columns to determine uniqueness.

    • The above example keeps the first occurrence of each unique value in the Team variable.

🧩 Dropping Duplicates Using a Column Combination

emp.drop_duplicates(subset = ["Gender", "Team"])
  • The above example uses a combination of values across the Gender and Team variables to identify duplicates.

⚙️ Using keep Inside drop_duplicates()

emp.drop_duplicates(subset = ["Team"], keep = "last")
emp.drop_duplicates(subset = ["Team"], keep = False)
  • The drop_duplicates() method also accepts a keep parameter.

    • 'first' (default): Keeps the first occurrence, removes the rest
    • 'last': Keeps the last occurrence, removes the rest
    • False: Removes all duplicates, keeping only unique entries
  • Q. What does emp.drop_duplicates(subset = ["First Name"], keep = False) do?

  • Q. Find a subset of all employees with a First Name of “Douglas” and a Gender of “Male”. Then check which “Douglas” is in the DataFrame emp.drop_duplicates(subset = ["Gender", "Team"]).

🚀 Classwork 14: Pandas Fundamentals

Let’s do Questions 7-12 in Classwork 14!