Lecture 9

Pandas Fundamentals IV: Counting Options; Missing Values; Duplicate Values

Byeong-Hak Choe

SUNY Geneseo

April 13, 2026

#️⃣🔢 Useful options for `.value_counts()`

A DataFrame for `.value_counts()` examples

import pandas as pd
import numpy as np

# Below is for an interactive display of DataFrame in Colab
from google.colab import data_table
data_table.enable_dataframe_formatter()

ncaa = pd.DataFrame({
    'Team': ['Lakers', 'Celtics', 'Lakers', 'Bulls', 'Heat', 'Bulls', 'Lakers', 'Heat', np.nan, 'Celtics'],
    'Position': ['PG', 'F', 'G', 'C', 'G', 'F', 'F', 'C', 'G', 'F'],
    'Age': [25, 29, 25, 31, 27, 22, 30, 27, 24, 29],
    'College': ['Duke', 'Kentucky', np.nan, 'UCLA', 'Duke', 'Gonzaga', 'Arizona', np.nan, 'Duke', 'Kentucky']
})

1. Include missing values with `dropna = False`

ncaa['College'].value_counts(dropna = False)

By default, missing values are excluded.
Setting dropna = False lets us count NaN values too.

2. Show proportions with `normalize = True`

ncaa['Team'].value_counts(normalize = True)

Instead of raw counts, this gives relative frequencies.
Each value shows the proportion of the total.

3. Keep the original order with `sort = False`

ncaa['Position'].value_counts(sort = False)

By default, .value_counts() sorts from the most frequent to the least frequent.
sort = False keeps the categories in the order they first appear.

4. Sort in ascending order with `ascending = True`

ncaa['Team'].value_counts(ascending = True)

This sorts counts from the smallest to the largest.
Helpful when we want to spot rare categories first.

5. Bin numeric values with `bins =`

ncaa['Age'].value_counts(bins = 4)

For numeric variables, we can group values into intervals.
This gives frequency counts by ranges instead of exact values.

6. Count combinations across multiple columns

ncaa[['Team', 'Position']].value_counts()

When we pass multiple columns, pandas counts unique combinations.
This is useful for two-way category summaries.

7. Normalize multi-column combinations

ncaa[['Team', 'Position']].value_counts(normalize = True)

This gives the proportion for each combination instead of the raw count.
Useful when comparing how common each pair is overall.

🚀 Classwork 13: Pandas Fundamentals

Let’s do Classwork 13!

📂 `emp` DataFrame

Let’s read employment.csv as emp.

import pandas as pd
import numpy as np

# Below is for an interactive display of DataFrame in Colab
from google.colab import data_table
data_table.enable_dataframe_formatter()

emp = pd.read_csv("https://bcdanl.github.io/data/employment.csv")

🕳️ Dealing with Missing Values

📂 Reloading `emp` for Missing-Value Examples

Let’s read employment.csv as emp.

import pandas as pd
# Below is for an interactive display of DataFrame in Colab
from google.colab import data_table
data_table.enable_dataframe_formatter()

emp = pd.read_csv("https://bcdanl.github.io/data/employment.csv")

❓ What `<NA>`, `NaN`, and `NaT` Mean in pandas

In a pandas DataFrame, special markers are used to represent missing values.
- NaN indicates a missing numeric value.
- NaT indicates a missing date or time value.
- <NA> is pandas’ general missing-value marker.
More specifically:
- pd.NA is pandas’ general missing-value object, which is typically displayed as <NA>.
- np.nan is commonly used for missing numeric values and is typically displayed as NaN.
- pd.NaT is used for missing date or time values and is typically displayed as NaT.

🔎 Detecting Missing Values with `isna()`

emp["Team"].isna()
emp["Start Date"].isna()

The isna() method returns a Boolean Series in which True denotes that an observation’s value is missing.
- Is a value of a variable “XYZ” missing?

✅ Detecting Non-Missing Values with `notna()`

# Below two are equivalent.
emp["Team"].notna()
~emp["Team"].isna()

Both isna() and notna() treat all common pandas missing-value markers as missing.
The notna() method returns the inverse Series, one in which True indicates that an observation’s value is present.
We use the tilde symbol (~) to invert a Boolean Series.
Q. How can we pull out employees with non-missing Team values?

🔢 Counting Missing Values in Different Ways

emp["Mgmt"].isna().sum()
emp["Mgmt"].value_counts()
emp["Mgmt"].value_counts(dropna = False)

One way to count missing values is to use isna().sum() on a Series.
- Here, True is treated like 1 and False like 0.
Another way is to use .value_counts() on a Series.
- With dropna = False, pandas also reports the number of missing values.

🤔 Dealing with Missing Values in a Boolean Condition with `np.where()`

Note

emp['high_salary'] = np.where(emp['Salary'] >= 100000, True, False) treats NaN values in Salary as not meeting the condition, so they are labeled False.

Question: How can we set the high_salary column value to <NA> whenever the Salary column value is missing (NaN)?

🗑️ Removing Rows with Any Missing Values

emp.dropna()

The dropna() method removes observations that hold any NaN or NaT values.

🧹 Removing Rows Only When Everything Is Missing

emp.dropna(how = "all")

We can pass the how parameter an argument of "all" to remove observations in which all values are missing.
Note that the how parameter’s default argument is "any".

🎯 Dropping Rows Based on One Key Column

emp.dropna(subset = ["Gender"])

We can use the subset parameter to target observations with a missing value in a specific variable.
- The above example removes observations that have a missing value in the Gender variable.

🎯 Dropping Rows Based on Multiple Key Columns

emp.dropna(subset = ["Start Date", "Salary"])

We can also pass the subset parameter a list of variables.

🚀 Classwork 14: Pandas Fundamentals

Let’s do Questions 1-6 in Classwork 14!

👯 Dealing with Duplicates

🪞 Spotting Repeated Values with `duplicated()`

Missing values are a common occurrence in messy data sets, and so are duplicate values.

emp["Team"].duplicated()

The duplicated() method returns a Boolean Series that identifies whether a value has already appeared earlier in the variable.

🔁 Controlling Which Duplicate to Keep

emp["Team"].duplicated(keep = "first")
emp["Team"].duplicated(keep = "last")
~emp["Team"].duplicated()

The duplicated() method’s keep parameter informs pandas which duplicate occurrence to keep.
- 'first' (default): Marks duplicates after the first as True
- 'last': Marks duplicates before the last as True
- False: Marks all duplicates as True

Note

emp["Team"].duplicated() treats NaN values as duplicate-able.

The first NaN is marked False, and subsequent NaN values are marked True (with keep="first"), because pandas considers NaN values as matching when checking for duplicates.

📌 Detecting Duplicates Using One Column

emp.duplicated(subset = ["Team"])

We can pass a subset argument to duplicated() so that pandas uses only selected columns to determine duplication.
- The above example marks rows as True when the value in Team has already appeared earlier.

🧩 Detecting Duplicates Using a Column Combination

emp.duplicated(subset = ["Gender", "Team"])

The above example uses a combination of values across the Gender and Team variables to identify duplicates.
- A row is marked as True only when the same combination of Gender and Team has appeared before.

✂️ Removing Fully Duplicated Rows

emp.drop_duplicates()

The drop_duplicates() method removes observations whose full rows are identical to rows that appeared earlier.

🧪 Small Example of `drop_duplicates()`

Below is an example of the drop_duplicates() method:

# Sample DataFrame with duplicate observations
data = {
    'Name': ['John', 'Anna', 'John', 'Mike', 'Anna'],
    'Age': [28, 23, 28, 32, 23],
    'City': ['New York', 'Paris', 'New York', 'London', 'Paris']
}

# pd.DataFrame( Series, List, or Dict ) creates a DataFrame
df = pd.DataFrame(data)  
df_unique = df.drop_duplicates()

📌 Dropping Duplicates Using One Column

emp.drop_duplicates(subset = ["Team"])

We can pass a subset argument to drop_duplicates() so that pandas uses only selected columns to determine uniqueness.
- The above example keeps the first occurrence of each unique value in the Team variable.

🧩 Dropping Duplicates Using a Column Combination

emp.drop_duplicates(subset = ["Gender", "Team"])

The above example uses a combination of values across the Gender and Team variables to identify duplicates.

⚙️ Using `keep` Inside `drop_duplicates()`

emp.drop_duplicates(subset = ["Team"], keep = "last")
emp.drop_duplicates(subset = ["Team"], keep = False)

The drop_duplicates() method also accepts a keep parameter.
- 'first' (default): Keeps the first occurrence, removes the rest
- 'last': Keeps the last occurrence, removes the rest
- False: Removes all duplicates, keeping only unique entries
Q. What does emp.drop_duplicates(subset = ["First Name"], keep = False) do?
Q. Find a subset of all employees with a First Name of “Douglas” and a Gender of “Male”. Then check which “Douglas” is in the DataFrame emp.drop_duplicates(subset = ["Gender", "Team"]).

🚀 Classwork 14: Pandas Fundamentals

Let’s do Questions 7-12 in Classwork 14!

Lecture 9

#️⃣🔢 Useful options for .value_counts()

A DataFrame for .value_counts() examples

1. Include missing values with dropna = False

2. Show proportions with normalize = True

3. Keep the original order with sort = False

4. Sort in ascending order with ascending = True

5. Bin numeric values with bins =

6. Count combinations across multiple columns

7. Normalize multi-column combinations

🚀 Classwork 13: Pandas Fundamentals

📂 emp DataFrame

🕳️ Dealing with Missing Values

📂 Reloading emp for Missing-Value Examples

❓ What <NA>, NaN, and NaT Mean in pandas

🔎 Detecting Missing Values with isna()

✅ Detecting Non-Missing Values with notna()

🔢 Counting Missing Values in Different Ways

🤔 Dealing with Missing Values in a Boolean Condition with np.where()

🗑️ Removing Rows with Any Missing Values

🧹 Removing Rows Only When Everything Is Missing

🎯 Dropping Rows Based on One Key Column

🎯 Dropping Rows Based on Multiple Key Columns

🚀 Classwork 14: Pandas Fundamentals

👯 Dealing with Duplicates

🪞 Spotting Repeated Values with duplicated()

🔁 Controlling Which Duplicate to Keep

📌 Detecting Duplicates Using One Column

🧩 Detecting Duplicates Using a Column Combination

✂️ Removing Fully Duplicated Rows

🧪 Small Example of drop_duplicates()

📌 Dropping Duplicates Using One Column

🧩 Dropping Duplicates Using a Column Combination

⚙️ Using keep Inside drop_duplicates()

🚀 Classwork 14: Pandas Fundamentals

#️⃣🔢 Useful options for `.value_counts()`

A DataFrame for `.value_counts()` examples

1. Include missing values with `dropna = False`

2. Show proportions with `normalize = True`

3. Keep the original order with `sort = False`

4. Sort in ascending order with `ascending = True`

5. Bin numeric values with `bins =`

📂 `emp` DataFrame

📂 Reloading `emp` for Missing-Value Examples

❓ What `<NA>`, `NaN`, and `NaT` Mean in pandas

🔎 Detecting Missing Values with `isna()`

✅ Detecting Non-Missing Values with `notna()`

🤔 Dealing with Missing Values in a Boolean Condition with `np.where()`

🪞 Spotting Repeated Values with `duplicated()`

🧪 Small Example of `drop_duplicates()`

⚙️ Using `keep` Inside `drop_duplicates()`