Pandas Fundamentals III: Data Types; Filtering by a Condition
April 6, 2026
emp DataFrameastype() MethodMgmt Before ConversionMgmt variable?astype() method converts a Seriesβ values to a different data type.
Mgmt to BooleanMgmt variable with our new Series of Booleans.Salary Cannot Yet Become IntegerSalary variable to integers with the astype() method.
NaN values to integers.fillna() Methodfillna() method replaces a Seriesβ missing values with a value that we specify.0 only to illustrate the method.
0 may distort the data.Salary After Filling Missing ValuesSalary variable with our new Series of integers.Gender to Categorycategory.
categories optionpd.Categorical(emp["Gender"], categories = ['Male', 'Female']).sort_values()
pd.Categorical(emp["Gender"], categories = ['Female', 'Male']).sort_values()categories option, sorting is alphabetical (incorrect for many cases).categories option to define an order of categories.pd.to_datetime() Method# Below two are equivalent:
emp["Start Date"] = pd.to_datetime(emp["Start Date"])
emp["Start Date"] = emp["Start Date"].astype('datetime64[ns]')pd.to_datetime() function converts a Series, a DataFrame, or a single column into a proper datetime format.emp = pd.read_csv("https://bcdanl.github.io/data/employment.csv")
emp.info()
emp["Salary"] = emp["Salary"].fillna(0)
emp = emp.astype({'Mgmt': 'bool',
'Salary': 'int',
'Gender': 'category',
'Start Date': 'datetime64[ns]',
'Team': 'category'})astype().Letβs do Questions 1-4 in Classwork 12!
We often do not know the exact index labels or positions of the observations we want to extract.
Instead, we can target observations by a Boolean condition.
V1 and V2 are variables or values, and the comparisons are applied element-wise (vectorized).V1 and V2 are int or float.Series with a constant value, we place the Series on one side of the equality operator (==) and the value on the other.
Series == valueFirst Name value with βDonnaβ.
Series.Series between square brackets following the DataFrame.
DataFrame[ Boolean_Series ]Series to an object and then pass it into the square brackets instead.Marketingβ team?True denotes that the Team value for a given index is not βMarketingβ, and False indicates the Team value is βMarketingβMgmt Is TrueTrue in the Mgmt variable.emp["Mgmt"] == True, but we do not need to.x and y are boolean conditions/variables.&, |, ~) do is combining logical variables/conditions, which returns a boolean variable when executed.x and y are boolean conditions.
x is TRUE, it highlights the left circle.y is TRUE, it highlights the right circle.
|sales = emp["Team"] == "Sales"
legal = emp["Team"] == "Legal"
fnce = emp["Team"] == "Finance"
emp[ sales | legal | fnce ] # '|' means 'or'We could provide three separate Boolean Series inside the square brackets and add the | symbol to declare OR criteria.
What if our next report asked for employees from 30 teams instead of three?
isin() Methodstar_teams = ["Sales", "Legal", "Finance"]
on_star_teams = emp["Team"].isin(star_teams)
emp[ on_star_teams ]isin() method, which accepts an iterable such as a list, tuple, array, or Series and returns a Boolean Series.&higher_than_90k = emp["Salary"] >= 90000
lower_than_100k = emp["Salary"] < 100000
emp[ higher_than_90k & lower_than_100k ] # '&' means 'and'We can create two Boolean Series, one to declare the lower bound and one to declare the upper bound.
Then we can use the & operator to mandate that both conditions are True.
between()between().
True denotes that an observationβs value falls between the specified interval.between() on Stringsbetween() method to string variables.
query() method!emp.query("Salary >= 100000 & Team == 'Finance'")
emp.query("Salary >= 100000 & `First Name` == 'Douglas'")query() method filters observations using a concise, string-based syntax.
query() accepts a string that describes the filtering condition(s).`).np.where()np.where() from NumPy to add a new variable to a DataFrame based on a condition.import numpy as np
# Using np.where to add the 'pass_fail' column
emp['high_salary'] = np.where(emp['Salary'] >= 100000, 'Yes', 'No')high_salary:
Salary is greater than or equal to 100,000.Note
np.where(emp['Salary'] >= 100000, 'Yes', 'No') treats NaN values in Salary as not satisfying the condition, so they are labeled βNoβ.
Letβs do Questions 5-13 in Classwork 12!