Lecture 8

Pandas Fundamentals III: Data Types; Filtering by a Condition

Byeong-Hak Choe

SUNY Geneseo

April 6, 2026

📂 `emp` DataFrame

Let’s read employment.csv as emp.

import pandas as pd
import numpy as np

# Below is for an interactive display of DataFrame in Colab
from google.colab import data_table
data_table.enable_dataframe_formatter()

emp = pd.read_csv("https://bcdanl.github.io/data/employment.csv")

🔄 Converting Data Types with the `astype()` Method

🔍 Inspecting `Mgmt` Before Conversion

What values are in the Mgmt variable?

emp["Mgmt"].astype(bool)

The astype() method converts a Series’ values to a different data type.
- It can accept a single argument: the new data type.

✅ Converting `Mgmt` to Boolean

emp["Mgmt"] = emp["Mgmt"].astype(bool)

The above code overwrites the Mgmt variable with our new Series of Booleans.

⚠️ Why `Salary` Cannot Yet Become Integer

emp["Salary"].astype(int)

The above code tries to convert the Salary variable to integers with the astype() method.
- pandas cannot directly convert NaN values to integers.

🩹 Filling Missing Values with the `fillna()` Method

emp["Salary"].fillna(0)

The fillna() method replaces a Series’ missing values with a value that we specify.
In this example, we use 0 only to illustrate the method.
- In real analysis, filling missing values with 0 may distort the data.

🔁 Converting `Salary` After Filling Missing Values

emp["Salary"] = emp["Salary"].fillna(0).astype(int)

The above code overwrites the Salary variable with our new Series of integers.

🏷️ Converting `Gender` to Category

emp["Gender"] = emp["Gender"].astype("category")
emp['Gender'] = pd.Categorical(emp["Gender"])

pandas includes a special data type called category.
- It is useful when a variable has only a small number of distinct values relative to the total number of observations.
- Examples include gender, weekdays, blood types, planets, and income groups.

🏷️ Setting a category order with `categories` option

pd.Categorical(emp["Gender"], categories = ['Male', 'Female']).sort_values()
pd.Categorical(emp["Gender"], categories = ['Female', 'Male']).sort_values()

Without categories option, sorting is alphabetical (incorrect for many cases).
Use categories option to define an order of categories.

📅 Converting Data Types with the `pd.to_datetime()` Method

# Below two are equivalent:
emp["Start Date"] = pd.to_datetime(emp["Start Date"])
emp["Start Date"] = emp["Start Date"].astype('datetime64[ns]')

The pd.to_datetime() function converts a Series, a DataFrame, or a single column into a proper datetime format.

🧰 Converting Multiple Columns at Once

emp = pd.read_csv("https://bcdanl.github.io/data/employment.csv")
emp.info()

emp["Salary"] = emp["Salary"].fillna(0)
emp = emp.astype({'Mgmt': 'bool', 
                  'Salary': 'int',
                  'Gender': 'category',
                  'Start Date': 'datetime64[ns]',
                  'Team': 'category'})

We can provide a dictionary of variable-type pairs to astype().

🚀 Classwork 12: Pandas Fundamentals

Let’s do Questions 1-4 in Classwork 12!

🔍 Filtering by a Condition

🧭 Why Conditional Filtering Matters

We often do not know the exact index labels or positions of the observations we want to extract.
Instead, we can target observations by a Boolean condition.

⚖️ Boolean Conditions with Equality and Inequality

Here, both V1 and V2 are variables or values, and the comparisons are applied element-wise (vectorized).

For boolean conditions using inequalities, we focus on cases where V1 and V2 are int or float.

👤 Building One Condition: Name Equals “Donna”

emp["First Name"] == "Donna"

To compare every value in Series with a constant value, we place the Series on one side of the equality operator (==) and the value on the other.
- Series == value
The above example compares each First Name value with “Donna”.
- pandas performs a vectorized operation, meaning it checks every value in the Series.

📥 Using a Boolean Series to Filter Rows

emp[ emp["First Name"] == "Donna" ]

To filter observations, we provide the Boolean Series between square brackets following the DataFrame.
- DataFrame[ Boolean_Series ]

💾 Saving the Condition in an Object

donnas = emp["First Name"] == "Donna"
emp[ donnas ]

If the use of multiple square brackets is confusing, we can assign the Boolean Series to an object and then pass it into the square brackets instead.

🚫 Filtering Out the Marketing Team

What if we want to extract a subset of employees who are not on the “Marketing” team?

non_marketing = emp["Team"] != "Marketing"  # != means "not equal to"
emp[ non_marketing ]

True denotes that the Team value for a given index is not “Marketing”, and False indicates the Team value is “Marketing”

👔 Filtering Rows Where `Mgmt` Is True

What if we want to retrieve all the managers in the company?
- Managers have a value of True in the Mgmt variable.

emp[ emp["Mgmt"] ]

We could execute emp["Mgmt"] == True, but we do not need to.

💵 Filtering High Earners with a Numeric Rule

high_earners = emp["Salary"] > 100000
emp[ high_earners ]

We can also use arithmetic operands to filter observations based on mathematical conditions.

🔍 Boolean Operators — AND, OR, NOT

Here, both x and y are boolean conditions/variables.

What boolean operations (&, |, ~) do is combining logical variables/conditions, which returns a boolean variable when executed.

🎯 Venn Diagram for Boolean Logic

x and y are boolean conditions.
- If x is TRUE, it highlights the left circle.
- If y is TRUE, it highlights the right circle.

The shaded regions show which parts each boolean operator selects.

🔀 Combining Multiple Conditions with `|`

sales = emp["Team"] == "Sales"
legal = emp["Team"] == "Legal"
fnce = emp["Team"] == "Finance"
emp[ sales | legal | fnce ] # '|' means 'or'

We could provide three separate Boolean Series inside the square brackets and add the | symbol to declare OR criteria.
What if our next report asked for employees from 30 teams instead of three?

📋 Filtering with the `isin()` Method

star_teams = ["Sales", "Legal", "Finance"]
on_star_teams = emp["Team"].isin(star_teams)
emp[ on_star_teams ]

A cleaner solution is the isin() method, which accepts an iterable such as a list, tuple, array, or Series and returns a Boolean Series.

📏 Combining Lower and Upper Bounds with `&`

When working with numbers or dates, we often want to extract values that fall within a range.
- E.g., Identify all employees with a salary between $90,000 and $100,000.

higher_than_90k = emp["Salary"] >= 90000
lower_than_100k = emp["Salary"] < 100000
emp[ higher_than_90k & lower_than_100k ] # '&' means 'and'

We can create two Boolean Series, one to declare the lower bound and one to declare the upper bound.
Then we can use the & operator to mandate that both conditions are True.

💰 Filtering Salary Ranges with `between()`

between_90k_and_100k = emp["Salary"].between(90000, 100000)
emp[ between_90k_and_100k ]

A slightly cleaner solution is to use a method called between().
- It returns a Boolean Series where True denotes that an observation’s value falls between the specified interval.
- The first argument, the lower bound, is inclusive, and the second argument, the upper bound, is also inclusive.

🔤 Using `between()` on Strings

name_starts_with_t = emp["First Name"].between("T", "U")
emp[ name_starts_with_t ]

We can also apply the between() method to string variables.
- The first argument, the lower bound, is inclusive, and the second argument, the upper bound, is exclusive.

🔎 Filtering by a Condition with the `query()` method!

emp.query("Salary >= 100000 & Team == 'Finance'")
emp.query("Salary >= 100000 & `First Name` == 'Douglas'")

The query() method filters observations using a concise, string-based syntax.
- query() accepts a string that describes the filtering condition(s).
If a column name contains spaces, wrap it in backticks (`).

✅ ❌ Adding a Variable Based on a Condition with `np.where()`

We can use np.where() from NumPy to add a new variable to a DataFrame based on a condition.

import numpy as np

# Using np.where to add the 'pass_fail' column
emp['high_salary'] = np.where(emp['Salary'] >= 100000, 'Yes', 'No')

We add a new variable called high_salary:
- “Yes” if Salary is greater than or equal to 100,000.
- “No” otherwise.

Note

np.where(emp['Salary'] >= 100000, 'Yes', 'No') treats NaN values in Salary as not satisfying the condition, so they are labeled “No”.

🚀 Classwork 12: Pandas Fundamentals

Let’s do Questions 5-13 in Classwork 12!

Lecture 8

📂 emp DataFrame

🔄 Converting Data Types with the astype() Method

🔍 Inspecting Mgmt Before Conversion

✅ Converting Mgmt to Boolean

⚠️ Why Salary Cannot Yet Become Integer

🩹 Filling Missing Values with the fillna() Method

🔁 Converting Salary After Filling Missing Values

🏷️ Converting Gender to Category

🏷️ Setting a category order with categories option

📅 Converting Data Types with the pd.to_datetime() Method

🧰 Converting Multiple Columns at Once

🚀 Classwork 12: Pandas Fundamentals

🔍 Filtering by a Condition

🧭 Why Conditional Filtering Matters

⚖️ Boolean Conditions with Equality and Inequality

👤 Building One Condition: Name Equals “Donna”

📥 Using a Boolean Series to Filter Rows

💾 Saving the Condition in an Object

🚫 Filtering Out the Marketing Team

👔 Filtering Rows Where Mgmt Is True

💵 Filtering High Earners with a Numeric Rule

🔍 Boolean Operators — AND, OR, NOT

🎯 Venn Diagram for Boolean Logic

🔀 Combining Multiple Conditions with |

📋 Filtering with the isin() Method

📏 Combining Lower and Upper Bounds with &

💰 Filtering Salary Ranges with between()

🔤 Using between() on Strings

🔎 Filtering by a Condition with the query() method!

✅ ❌ Adding a Variable Based on a Condition with np.where()

🚀 Classwork 12: Pandas Fundamentals

📂 `emp` DataFrame

🔄 Converting Data Types with the `astype()` Method

🔍 Inspecting `Mgmt` Before Conversion

✅ Converting `Mgmt` to Boolean

⚠️ Why `Salary` Cannot Yet Become Integer

🩹 Filling Missing Values with the `fillna()` Method

🔁 Converting `Salary` After Filling Missing Values

🏷️ Converting `Gender` to Category

🏷️ Setting a category order with `categories` option

📅 Converting Data Types with the `pd.to_datetime()` Method

👔 Filtering Rows Where `Mgmt` Is True

🔀 Combining Multiple Conditions with `|`

📋 Filtering with the `isin()` Method

📏 Combining Lower and Upper Bounds with `&`

💰 Filtering Salary Ranges with `between()`

🔤 Using `between()` on Strings

🔎 Filtering by a Condition with the `query()` method!

✅ ❌ Adding a Variable Based on a Condition with `np.where()`