Converting Data Types; Filtering by Conditions; Dealing with Missing Values
February 27, 2024
drop(columns = ... )
.columns.get_loc()
, .pop()
, and .insert()
ref_var = emp.columns.get_loc('Salary')
var_to_move = emp.pop('Mgmt')
emp.insert(ref_var, 'Mgmt', var_to_move) # insert() directly alters 'emp'
DataFrame.columns.get_loc('Reference_Var')
DataFrame.pop('Some_Var_To_Move')
Series
DataFrame.insert(ref_var, 'Some_Var_To_Move', var_to_move)
astype()
Methodastype()
methodMgmt
variable?astype()
method converts a Series
’ values to a different data type.
astype()
methodMgmt
variable with our new Series
of Booleans.astype()
methodSalary
variable’s values to integers with the astype()
method.
NaN
values to integers.fillna()
methodfillna()
method replaces a Series
’ missing values with the argument we pass in.0
.
0
is passed solely for the sake of example.fillna()
methodSalary
variable with our new Series
of integers.astype()
methodcategory
,
pd.to_datetime()
method# Below two are equivalent:
emp["Start Date"] = pd.to_datetime(emp["Start Date"])
emp["Start Date"] = emp["Start Date"].astype('datetime64')
pd.to_datetime()
function is used to convert a Series
, DataFrame
, or a single variable of a DataFrame
from its current data type into datetime
format.astype()
methodemp = pd.read_csv("https://bcdanl.github.io/data/employment.csv")
emp["Salary"] = emp["Salary"].fillna(0)
emp = emp.astype({'Mgmt': 'bool',
'Salary': 'int',
'Gender': 'category',
'Start Date': 'datetime64'})
astype()
.Let’s do Question 1 in Classwork 4!
We may often not know the index labels and positions of the observations we want to target.
We may want to target observations not by an index label but by a Boolean condition.
Series
with a constant value, we place the Series
on one side of the equality operator (==
) and the value on the other.
Series == value
First Name
value with “Donna”.
Series
.Series
between square brackets following the DataFrame
.
DataFrame[ Boolean_Series ]
Series
to an object and then pass it into the square brackets instead.Marketing
” team?True
denotes that the Team
value for a given index is not “Marketing
”, and False
indicates the Team
value is “Marketing
”True
in the Mgmt
variable.emp["Mgmt"] == True
, but we do not need to.sales = emp["Team"] == "Sales"
legal = emp["Team"] == "Legal"
fnce = emp["Team"] == "Finance"
emp[ sales | legal | fnce ] # '|' is 'or' opeartor
We could provide three separate Boolean Series
inside the square brackets and add the |
symbol to declare OR
criteria.
What if our next report asked for employees from 30 teams instead of three?
isin()
methodstar_teams = ["Sales", "Legal", "Finance"]
on_star_teams = emp["Team"].isin(star_teams)
emp[ on_star_teams ]
isin()
method, which accepts an iterable (e.g., list
, tuple
, array
, Series
) and returns a Boolean Series
.higher_than_90k = emp["Salary"] >= 90000
lower_than_100k = emp["Salary"] < 100000
emp[ higher_than_90k & lower_than_100k ] # '&' is 'and' opeartor
We can create two Boolean Series
, one to declare the lower bound and one to declare the upper bound.
Then we can use the &
operator to mandate that both conditions are True
.
between()
methodbetween()
.
True
denotes that an observation’s value falls between the specified interval.between()
methodbetween()
method to string variables.query()
method!emp.query("Salary >= 100000 & Team == 'Finance'")
emp.query("Salary >= 100000 & `First Name` == 'Douglas'")
query()
method filters observations using a concise, string-based query syntax.
query()
accepts a string value that describes filtering conditions.query()
method, if we have variable names with spaces, we can wrap the variable names in backtick (`).
Let’s do Questions 2-6 in Classwork 4!
NaN
(not a number);
NaT
(not a time).isna()
and notna()
methodsisna()
method returns a Boolean Series
in which True
denotes that an observation’s value is missing.
isna()
and notna()
methodsThe notna()
method returns the inverse Series
, one in which True
indicates that an observation’s value is present.
We use the tilde symbol (~
) to invert a Boolean Series
.
Q. How can we pull out employees with non-missing Team
values?
dropna()
methodemp = pd.read_csv("https://bcdanl.github.io/data/employment.csv",
parse_dates = ["Start Date"])
emp.dropna()
dropna()
method removes observations that hold any NaN
or NaT
values.dropna()
method with how
We can pass the how
parameter an argument of "all"
to remove observations in which all values are missing.
Note that the how
parameter’s default argument is "any"
.
dropna()
method with subset
subset
parameter to target observations with a missing value in a specific variable.
Gender
variable.dropna()
method with subset
subset
parameter a list of variables.dropna()
method with thresh
thresh
parameter specifies a minimum threshold of non-missing values that an observation must have for pandas to keep it.duplicated()
methodduplicated()
method returns a Boolean Series
that identifies duplicates in a variable.duplicated()
methodemp["Team"].duplicated(keep = "first")
emp["Team"].duplicated(keep = "last")
~emp["Team"].duplicated()
duplicated()
method’s keep
parameter informs pandas which duplicate occurrence to keep.
"first"
, keeps the first occurrence of each duplicate value."last"
, keeps the last occurrence of each duplicate value.drop_duplicates()
methoddrop_duplicates()
method removes observations in which all values are equal to those in a previously encountered observations.drop_duplicates()
methoddrop_duplicates()
method:# Sample DataFrame with duplicate observations
data = {
'Name': ['John', 'Anna', 'John', 'Mike', 'Anna'],
'Age': [28, 23, 28, 32, 23],
'City': ['New York', 'Paris', 'New York', 'London', 'Paris']
}
# pd.DataFrame( Series, List, or Dict ) creates a DataFrame
df = pd.DataFrame(data)
df_unique = df.drop_duplicates()
drop_duplicates()
methodWe can pass the drop_duplicates()
method a subset
parameter with a list of columns that pandas should use to determine an observation’s uniqueness.
drop_duplicates()
methodGender
and Team
variables to identify duplicates.drop_duplicates()
methodemp.drop_duplicates(subset = ["Team"], keep = "last")
emp.drop_duplicates(subset = ["Team"], keep = False)
The drop_duplicates()
method also accepts a keep
parameter.
"last"
to keep the observations with each duplicate value’s last occurrence.False
to exclude all observations with duplicate values.Q. What does emp.drop_duplicates(subset = ["First Name"], keep = False)
do?
Q. Find a subset of all employees with a First Name of “Douglas” and a Gender of “Male”. Then check which “Douglas” is in the DataFrame emp.drop_duplicates(subset = ["Gender", "Team"])
.
Let’s do Questions 7-8 in Classwork 4!