Lecture 10

pandas Basics - Filtering by a Condition

Byeong-Hak Choe

SUNY Geneseo

February 19, 2025

Employment Data

  • Let’s read employment.csv as emp.
import pandas as pd
# Below is for an interactive display of DataFrame in Colab
from google.colab import data_table
data_table.enable_dataframe_formatter()

emp = pd.read_csv("https://bcdanl.github.io/data/employment.csv")

Filtering by a Condition

Filtering by a Condition

  • We may often not know the index labels and positions of the observations we want to target.

  • We may want to target observations not by an index label but by a Boolean condition.

Filtering by a Single Condition

emp["First Name"] == "Donna"
  • To compare every value in Series with a constant value, we place the Series on one side of the equality operator (==) and the value on the other.
    • Series == value
  • The above example compares each First Name value with “Donna”.
    • pandas performs a vectorized operation (element-by-element operation) on Series.

Filtering by a Single Condition

emp[ emp["First Name"] == "Donna" ]
  • To filter observations, we provide the Boolean Series between square brackets following the DataFrame.
    • DataFrame[ Boolean_Series ]

Filtering by a Single Condition

donnas = emp["First Name"] == "Donna"
emp[ donnas ]
  • If the use of multiple square brackets is confusing, we can assign the Boolean Series to an object and then pass it into the square brackets instead.

Filtering by a Single Condition

  • What if we want to extract a subset of employees who are not on the “Marketing” team?
non_marketing = emp["Team"] != "Marketing"  # != means "not equal to"
emp[ non_marketing ]
  • True denotes that the Team value for a given index is not “Marketing”, and False indicates the Team value is “Marketing

Filtering by a Single Condition

  • What if we want to retrieve all the managers in the company?
    • Managers have a value of True in the Mgmt variable.
emp[ emp["Mgmt"] ]
  • We could execute emp["Mgmt"] == True, but we do not need to.

Filtering by a Single Condition

high_earners = emp["Salary"] > 100000
emp[ high_earners ]
  • We can also use arithmetic operands to filter observations based on mathematical conditions.

Filtering by a Condition

sales = emp["Team"] == "Sales"
legal = emp["Team"] == "Legal"
fnce = emp["Team"] == "Finance"
emp[ sales | legal | fnce ] # '|' is 'or' opeartor
  • We could provide three separate Boolean Series inside the square brackets and add the | symbol to declare OR criteria.

  • What if our next report asked for employees from 30 teams instead of three?

Filtering with the isin() method

star_teams = ["Sales", "Legal", "Finance"]
on_star_teams = emp["Team"].isin(star_teams)
emp[ on_star_teams ]
  • A better solution is the isin() method, which accepts an iterable (e.g., list, tuple, array, Series) and returns a Boolean Series.

Filtering by a Condition

  • When working with numbers or dates, we often want to extract values that fall within a range.
    • E.g., Identify all employees with a salary between $90,000 and $100,000.
higher_than_90k = emp["Salary"] >= 90000
lower_than_100k = emp["Salary"] < 100000
emp[ higher_than_90k & lower_than_100k ] # '&' is 'and' opeartor
  • We can create two Boolean Series, one to declare the lower bound and one to declare the upper bound.

  • Then we can use the & operator to mandate that both conditions are True.

Filtering with the between() method

between_90k_and_100k = emp["Salary"].between(90000, 100000)
emp[ between_90k_and_100k ]
  • A slightly cleaner solution is to use a method called between().
    • It returns a Boolean Series where True denotes that an observation’s value falls between the specified interval.
    • The first argument, the lower bound, is inclusive, and the second argument, the upper bound, is also inclusive.

Filtering with the between() method

name_starts_with_t = emp["First Name"].between("T", "U")
emp[ name_starts_with_t ]
  • We can also apply the between() method to string variables.
    • The first argument, the lower bound, is inclusive, and the second argument, the upper bound, is exclusive.

Filtering by a Condition with the query() method!

emp.query("Salary >= 100000 & Team == 'Finance'")
emp.query("Salary >= 100000 & `First Name` == 'Douglas'")
  • The query() method filters observations using a concise, string-based query syntax.
    • query() accepts a string value that describes filtering conditions.
  • When using the query() method, if we have variable names with spaces, we can wrap the variable names in backtick (`)

Adding a Variable based on a Condition using np.where()

  • We can use np.where from NumPy to add a new variable to a DataFrame based on a condition.
import numpy as np

# Using np.where to add the 'pass_fail' column
emp['high_salary'] = np.where(emp['Salary'] >= 100000, 'Yes', 'No')
  • We want to add a new variable high_salary:
    • “Yes” if score is greater than or equal to 100,000.
    • “No” otherwise.

Pandas Basics

Let’s do Questions 2-6 in Classwork 6!