Lecture 8

Pandas Fundamentals III: Data Types; Filtering by a Condition

Byeong-Hak Choe

SUNY Geneseo

April 3, 2026

πŸ“‚ emp DataFrame

  • Let’s read employment.csv as emp.
import pandas as pd
import numpy as np

# Below is for an interactive display of DataFrame in Colab
from google.colab import data_table
data_table.enable_dataframe_formatter()

emp = pd.read_csv("https://bcdanl.github.io/data/employment.csv")

πŸ”„ Converting Data Types with the astype() Method

πŸ” Inspecting Mgmt Before Conversion

  • What values are in the Mgmt variable?
emp["Mgmt"].astype(bool)
  • The astype() method converts a Series’ values to a different data type.
    • It can accept a single argument: the new data type.

βœ… Converting Mgmt to Boolean

emp["Mgmt"] = emp["Mgmt"].astype(bool)
  • The above code overwrites the Mgmt variable with our new Series of Booleans.

⚠️ Why Salary Cannot Yet Become Integer

emp["Salary"].astype(int)
  • The above code tries to convert the Salary variable to integers with the astype() method.
    • pandas cannot directly convert NaN values to integers.

🩹 Filling Missing Values with the fillna() Method

emp["Salary"].fillna(0)
  • The fillna() method replaces a Series’ missing values with a value that we specify.
  • In this example, we use 0 only to illustrate the method.
    • In real analysis, filling missing values with 0 may distort the data.

πŸ” Converting Salary After Filling Missing Values

emp["Salary"] = emp["Salary"].fillna(0).astype(int)
  • The above code overwrites the Salary variable with our new Series of integers.

🏷️ Converting Gender to Category

emp["Gender"] = emp["Gender"].astype("category")
  • pandas includes a special data type called category.
    • It is useful when a variable has only a small number of distinct values relative to the total number of observations.
    • Examples include gender, weekdays, blood types, planets, and income groups.

πŸ“… Converting Data Types with the pd.to_datetime() Method

# Below two are equivalent:
emp["Start Date"] = pd.to_datetime(emp["Start Date"])
emp["Start Date"] = emp["Start Date"].astype('datetime64[ns]')
  • The pd.to_datetime() function converts a Series, a DataFrame, or a single column into a proper datetime format.

🧰 Converting Multiple Columns at Once

emp = pd.read_csv("https://bcdanl.github.io/data/employment.csv")
emp.info()

emp["Salary"] = emp["Salary"].fillna(0)
emp = emp.astype({'Mgmt': 'bool', 
                  'Salary': 'int',
                  'Gender': 'category',
                  'Start Date': 'datetime64[ns]',
                  'Team': 'category'})
  • We can provide a dictionary of variable-type pairs to astype().

πŸ” Filtering by a Condition

🧭 Why Conditional Filtering Matters

  • We often do not know the exact index labels or positions of the observations we want to extract.

  • Instead, we can target observations by a Boolean condition.

πŸ‘€ Building One Condition: Name Equals β€œDonna”

emp["First Name"] == "Donna"
  • To compare every value in Series with a constant value, we place the Series on one side of the equality operator (==) and the value on the other.
    • Series == value
  • The above example compares each First Name value with β€œDonna”.
    • pandas performs a vectorized operation, meaning it checks every value in the Series.

πŸ“₯ Using a Boolean Series to Filter Rows

emp[ emp["First Name"] == "Donna" ]
  • To filter observations, we provide the Boolean Series between square brackets following the DataFrame.
    • DataFrame[ Boolean_Series ]

πŸ’Ύ Saving the Condition in an Object

donnas = emp["First Name"] == "Donna"
emp[ donnas ]
  • If the use of multiple square brackets is confusing, we can assign the Boolean Series to an object and then pass it into the square brackets instead.

🚫 Filtering Out the Marketing Team

  • What if we want to extract a subset of employees who are not on the β€œMarketing” team?
non_marketing = emp["Team"] != "Marketing"  # != means "not equal to"
emp[ non_marketing ]
  • True denotes that the Team value for a given index is not β€œMarketing”, and False indicates the Team value is β€œMarketing”

πŸ‘” Filtering Rows Where Mgmt Is True

  • What if we want to retrieve all the managers in the company?
    • Managers have a value of True in the Mgmt variable.
emp[ emp["Mgmt"] ]
  • We could execute emp["Mgmt"] == True, but we do not need to.

πŸ’΅ Filtering High Earners with a Numeric Rule

high_earners = emp["Salary"] > 100000
emp[ high_earners ]
  • We can also use arithmetic operands to filter observations based on mathematical conditions.

πŸ”€ Combining Multiple Conditions with |

sales = emp["Team"] == "Sales"
legal = emp["Team"] == "Legal"
fnce = emp["Team"] == "Finance"
emp[ sales | legal | fnce ] # '|' means 'or'
  • We could provide three separate Boolean Series inside the square brackets and add the | symbol to declare OR criteria.

  • What if our next report asked for employees from 30 teams instead of three?

πŸ“‹ Filtering with the isin() Method

star_teams = ["Sales", "Legal", "Finance"]
on_star_teams = emp["Team"].isin(star_teams)
emp[ on_star_teams ]
  • A cleaner solution is the isin() method, which accepts an iterable such as a list, tuple, array, or Series and returns a Boolean Series.

πŸ“ Combining Lower and Upper Bounds with &

  • When working with numbers or dates, we often want to extract values that fall within a range.
    • E.g., Identify all employees with a salary between $90,000 and $100,000.
higher_than_90k = emp["Salary"] >= 90000
lower_than_100k = emp["Salary"] < 100000
emp[ higher_than_90k & lower_than_100k ] # '&' means 'and'
  • We can create two Boolean Series, one to declare the lower bound and one to declare the upper bound.

  • Then we can use the & operator to mandate that both conditions are True.

πŸ’° Filtering Salary Ranges with between()

between_90k_and_100k = emp["Salary"].between(90000, 100000)
emp[ between_90k_and_100k ]
  • A slightly cleaner solution is to use a method called between().
    • It returns a Boolean Series where True denotes that an observation’s value falls between the specified interval.
    • The first argument, the lower bound, is inclusive, and the second argument, the upper bound, is also inclusive.

πŸ”€ Using between() on Strings

name_starts_with_t = emp["First Name"].between("T", "U")
emp[ name_starts_with_t ]
  • We can also apply the between() method to string variables.
    • The first argument, the lower bound, is inclusive, and the second argument, the upper bound, is exclusive.

πŸ”Ž Filtering by a Condition with the query() method!

emp.query("Salary >= 100000 & Team == 'Finance'")
emp.query("Salary >= 100000 & `First Name` == 'Douglas'")
  • The query() method filters observations using a concise, string-based syntax.
    • query() accepts a string that describes the filtering condition(s).
  • If a column name contains spaces, wrap it in backticks (`).

βœ… ❌ Adding a Variable Based on a Condition with np.where()

  • We can use np.where from NumPy to add a new variable to a DataFrame based on a condition.
import numpy as np

# Using np.where to add the 'pass_fail' column
emp['high_salary'] = np.where(emp['Salary'] >= 100000, 'Yes', 'No')
  • We add a new variable called high_salary:
    • β€œYes” if Salary is greater than or equal to 100,000.
    • β€œNo” otherwise.

πŸš€ Classwork 12: Pandas Fundamentals

Let’s do Classwork 12!