Lecture 4

Converting Data Types; Filtering by Conditions; Dealing with Missing Values

Byeong-Hak Choe

SUNY Geneseo

February 27, 2024

Removing and Relocating Variables

Pandas Basics

  • Let’s read employment.csv as emp.
import pandas as pd
# Below is for an interactive display of DataFrame in Colab
from google.colab import data_table
data_table.enable_dataframe_formatter()

emp = pd.read_csv("https://bcdanl.github.io/data/employment.csv")

Pandas Basics

Adding and Removing Variables

  • Here we use [] to add variables:
emp['Salary_k'] = emp['Salary'] / 1000
emp['Salary_2x'] = emp['Salary'] + emp['Salary']
emp['Salary_3x'] = emp['Salary'] * 3

Pandas Basics

Removing Variables with drop(columns = ... )

  • We can use .drop(columns = ...) to drop variables:
emp.drop(columns = "Salary_k")
emp.drop(columns = ["Salary_2x", "Salary_3x"])

Pandas Basics

Relocating Variables with .columns.get_loc(), .pop(), and .insert()

ref_var = emp.columns.get_loc('Salary') 
var_to_move = emp.pop('Mgmt')
emp.insert(ref_var, 'Mgmt', var_to_move) # insert() directly alters 'emp'
  • Step 1. DataFrame.columns.get_loc('Reference_Var')
    • Get the integer position (right before the reference variable, ‘Reference_Var’)
  • Step 2. DataFrame.pop('Some_Var_To_Move')
    • Remove the variable we want to relocate from the DataFrame and store it in a Series
  • Step 3. DataFrame.insert(ref_var, 'Some_Var_To_Move', var_to_move)
    • Insert the variable back into the DataFrame right after the reference variable.

Converting Data Types with the astype() Method

Pandas Basics

Converting Data Types with the astype() method

  • What values are in the Mgmt variable?
emp["Mgmt"].astype(bool)
  • The astype() method converts a Series’ values to a different data type.
    • It can accept a single argument: the new data type.

Pandas Basics

Converting Data Types with the astype() method

emp["Mgmt"] = emp["Mgmt"].astype(bool)
  • The above code overwrites the Mgmt variable with our new Series of Booleans.

Pandas Basics

Converting Data Types with the astype() method

emp["Salary"].astype(int)
  • The above code tries to coerce the Salary variable’s values to integers with the astype() method.
    • Pandas is unable to convert the NaN values to integers.

Pandas Basics

Fill Missing Values with the fillna() method

emp["Salary"].fillna(0)
  • The fillna() method replaces a Series’ missing values with the argument we pass in.
  • The above example provides a fill value of 0.
    • Note that our choice of value can distort the data; 0 is passed solely for the sake of example.

Pandas Basics

Fill Missing Values with the fillna() method

emp["Salary"] = emp["Salary"].fillna(0).astype(int)
  • The above code overwrites the Salary variable with our new Series of integers.

Pandas Basics

Converting Data Types with the astype() method

emp["Gender"] = emp["Gender"].astype("category")
  • Pandas includes a special data type called a category,
    • It is ideal for a variable consisting of a small number of unique values relative to its total size.
    • E.g., gender, weekdays, blood types, planets, and income groups.

Pandas Basics

Converting Data Types with the pd.to_datetime() method

# Below two are equivalent:
emp["Start Date"] = pd.to_datetime(emp["Start Date"])
emp["Start Date"] = emp["Start Date"].astype('datetime64')
  • The pd.to_datetime() function is used to convert a Series, DataFrame, or a single variable of a DataFrame from its current data type into datetime format.

Pandas Basics

Converting Data Types with the astype() method

emp = pd.read_csv("https://bcdanl.github.io/data/employment.csv")

emp["Salary"] = emp["Salary"].fillna(0)
emp = emp.astype({'Mgmt': 'bool', 
                  'Salary': 'int',
                  'Gender': 'category',
                  'Start Date': 'datetime64'})
  • We can provide a dictionary of variable-type pairs to astype().

Pandas Basics

Let’s do Question 1 in Classwork 4!

Filtering by a Single Condition

Pandas Basics

Filtering by a Condition

  • We may often not know the index labels and positions of the observations we want to target.

  • We may want to target observations not by an index label but by a Boolean condition.

Pandas Basics

Filtering by a Single Condition

emp["First Name"] == "Donna"
  • To compare every value in Series with a constant value, we place the Series on one side of the equality operator (==) and the value on the other.
    • Series == value
  • The above example compares each First Name value with “Donna”.
    • pandas performs a vectorized operation (element-by-element operation) on Series.

Pandas Basics

Filtering by a Single Condition

emp[ emp["First Name"] == "Donna" ]
  • To filter observations, we provide the Boolean Series between square brackets following the DataFrame.
    • DataFrame[ Boolean_Series ]

Pandas Basics

Filtering by a Single Condition

donnas = emp["First Name"] == "Donna"
emp[ donnas ]
  • If the use of multiple square brackets is confusing, we can assign the Boolean Series to an object and then pass it into the square brackets instead.

Pandas Basics

Filtering by a Single Condition

  • What if we want to extract a subset of employees who are not on the “Marketing” team?
non_marketing = emp["Team"] != "Marketing"  # != means "not equal to"
emp[ non_marketing ]
  • True denotes that the Team value for a given index is not “Marketing”, and False indicates the Team value is “Marketing

Pandas Basics

Filtering by a Single Condition

  • What if we want to retrieve all the managers in the company?
    • Managers have a value of True in the Mgmt variable.
emp[ emp["Mgmt"] ]
  • We could execute emp["Mgmt"] == True, but we do not need to.

Pandas Basics

Filtering by a Single Condition

high_earners = emp["Salary"] > 100000
emp[ high_earners ]
  • We can also use arithmetic operands to filter observations based on mathematical conditions.

Filtering by a Condition

Pandas Basics

Filtering by a Condition

sales = emp["Team"] == "Sales"
legal = emp["Team"] == "Legal"
fnce = emp["Team"] == "Finance"
emp[ sales | legal | fnce ] # '|' is 'or' opeartor
  • We could provide three separate Boolean Series inside the square brackets and add the | symbol to declare OR criteria.

  • What if our next report asked for employees from 30 teams instead of three?

Pandas Basics

Filtering with the isin() method

star_teams = ["Sales", "Legal", "Finance"]
on_star_teams = emp["Team"].isin(star_teams)
emp[ on_star_teams ]
  • A better solution is the isin() method, which accepts an iterable (e.g., list, tuple, array, Series) and returns a Boolean Series.

Pandas Basics

Filtering by a Condition

  • When working with numbers or dates, we often want to extract values that fall within a range.
    • E.g., Identify all employees with a salary between $90,000 and $100,000.
higher_than_90k = emp["Salary"] >= 90000
lower_than_100k = emp["Salary"] < 100000
emp[ higher_than_90k & lower_than_100k ] # '&' is 'and' opeartor
  • We can create two Boolean Series, one to declare the lower bound and one to declare the upper bound.

  • Then we can use the & operator to mandate that both conditions are True.

Pandas Basics

Filtering with the between() method

between_90k_and_100k = emp["Salary"].between(90000, 100000)
emp[ between_90k_and_100k ]
  • A slightly cleaner solution is to use a method called between().
    • It returns a Boolean Series where True denotes that an observation’s value falls between the specified interval.
    • The first argument, the lower bound, is inclusive, and the second argument, the upper bound, is exclusive.

Pandas Basics

Filtering with the between() method

name_starts_with_t = emp["First Name"].between("T", "U")
emp[ name_starts_with_t ]
  • We can also apply the between() method to string variables.

Pandas Basics

Filtering by a Condition with the query() method!

emp.query("Salary >= 100000 & Team == 'Finance'")
emp.query("Salary >= 100000 & `First Name` == 'Douglas'")
  • The query() method filters observations using a concise, string-based query syntax.
    • query() accepts a string value that describes filtering conditions.
  • When using the query() method, if we have variable names with spaces, we can wrap the variable names in backtick (`).
    • Backtick (`) is the key located next to the number 1 in a keyboard.

Pandas Basics

Let’s do Questions 2-6 in Classwork 4!

Dealing with Missing Values

Pandas Basics

Dealing with Missing Values

  • Pandas often marks (1) missing text values and (2) missing numeric values with a NaN (not a number);
    • It also marks missing datetime values with a NaT (not a time).

Pandas Basics

Dealing with Missing Values: The isna() and notna() methods

emp["Team"].isna()
emp["Start Date"].isna()
  • The isna() method returns a Boolean Series in which True denotes that an observation’s value is missing.
    • Is a value of a variable “XYZ” missing?

Pandas Basics

Dealing with Missing Values: The isna() and notna() methods

# Below two are equivalent.
emp["Team"].notna()
~emp["Team"].isna()
  • The notna() method returns the inverse Series, one in which True indicates that an observation’s value is present.

  • We use the tilde symbol (~) to invert a Boolean Series.

  • Q. How can we pull out employees with non-missing Team values?

Pandas Basics

Dealing with Missing Values: The dropna() method

emp = pd.read_csv("https://bcdanl.github.io/data/employment.csv",
                  parse_dates = ["Start Date"])
emp.dropna()
  • The dropna() method removes observations that hold any NaN or NaT values.

Pandas Basics

Dealing with Missing Values: The dropna() method with how

emp.dropna(how = "all")
  • We can pass the how parameter an argument of "all" to remove observations in which all values are missing.

  • Note that the how parameter’s default argument is "any".

Pandas Basics

Dealing with Missing Values: The dropna() method with subset

emp.dropna(subset = ["Gender"])
  • We can use the subset parameter to target observations with a missing value in a specific variable.
    • The above example removes observations that have a missing value in the Gender variable.

Pandas Basics

Dealing with Missing Values: The dropna() method with subset

emp.dropna(subset = ["Start Date", "Salary"])
  • We can also pass the subset parameter a list of variables.

Pandas Basics

Dealing with Missing Values: The dropna() method with thresh

emp.dropna(thresh = 4)
  • The thresh parameter specifies a minimum threshold of non-missing values that an observation must have for pandas to keep it.

Dealing with Duplicates

Pandas Basics

Dealing with Duplicates with the duplicated() method

  • Missing values are a common occurrence in messy data sets, and so are duplicate values.
emp["Team"].duplicated()
  • The duplicated() method returns a Boolean Series that identifies duplicates in a variable.

Pandas Basics

Dealing with Duplicates with the duplicated() method

emp["Team"].duplicated(keep = "first")
emp["Team"].duplicated(keep = "last")
~emp["Team"].duplicated()
  • The duplicated() method’s keep parameter informs pandas which duplicate occurrence to keep.
    • Its default argument, "first", keeps the first occurrence of each duplicate value.
    • Its argument, "last", keeps the last occurrence of each duplicate value.
  • Q. How can we keep observations with the first occurrences of a value in the Team variable?

Pandas Basics

Dealing with Duplicates with the drop_duplicates() method

emp.drop_duplicates()
  • The drop_duplicates() method removes observations in which all values are equal to those in a previously encountered observations.

Pandas Basics

Dealing with Duplicates with the drop_duplicates() method

  • Below is an example of the drop_duplicates() method:
# Sample DataFrame with duplicate observations
data = {
    'Name': ['John', 'Anna', 'John', 'Mike', 'Anna'],
    'Age': [28, 23, 28, 32, 23],
    'City': ['New York', 'Paris', 'New York', 'London', 'Paris']
}

# pd.DataFrame( Series, List, or Dict ) creates a DataFrame
df = pd.DataFrame(data)  
df_unique = df.drop_duplicates()

Pandas Basics

Dealing with Duplicates with the drop_duplicates() method

emp.drop_duplicates(subset = ["Team"])
  • We can pass the drop_duplicates() method a subset parameter with a list of columns that pandas should use to determine an observation’s uniqueness.

    • The above example finds the first occurrence of each unique value in the Team variable.

Pandas Basics

Dealing with Duplicates with the drop_duplicates() method

emp.drop_duplicates(subset = ["Gender", "Team"])
  • The above example uses a combination of values across the Gender and Team variables to identify duplicates.

Pandas Basics

Dealing with Duplicates with the drop_duplicates() method

emp.drop_duplicates(subset = ["Team"], keep = "last")
emp.drop_duplicates(subset = ["Team"], keep = False)
  • The drop_duplicates() method also accepts a keep parameter.

    • We can pass it an argument of "last" to keep the observations with each duplicate value’s last occurrence.
    • We can pass it an argument of False to exclude all observations with duplicate values.
  • Q. What does emp.drop_duplicates(subset = ["First Name"], keep = False) do?

  • Q. Find a subset of all employees with a First Name of “Douglas” and a Gender of “Male”. Then check which “Douglas” is in the DataFrame emp.drop_duplicates(subset = ["Gender", "Team"]).

Pandas Basics

Let’s do Questions 7-8 in Classwork 4!