Dealing with Missing Values and Duplicates; Group Operations
March 5, 2024
NaN
(not a number);
NaT
(not a time).isna()
and notna()
methodsisna()
method returns a Boolean Series
in which True
denotes that an observation’s value is missing.
isna()
and notna()
methodsThe notna()
method returns the inverse Series
, one in which True
indicates that an observation’s value is present.
We use the tilde symbol (~
) to invert a Boolean Series
.
Q. How can we pull out employees with non-missing Team
values?
value_counts(dropna = False)
methodisna().sum()
on a Series
.
True
is 1 and False
is 0..value_counts()
method on a Series
.
dropna=False
option, we can also get a missing value count.dropna()
methodemp = pd.read_csv("https://bcdanl.github.io/data/employment.csv",
parse_dates = ["Start Date"])
emp.dropna()
dropna()
method removes observations that hold any NaN
or NaT
values.dropna()
method with how
We can pass the how
parameter an argument of "all"
to remove observations in which all values are missing.
Note that the how
parameter’s default argument is "any"
.
dropna()
method with subset
subset
parameter to target observations with a missing value in a specific variable.
Gender
variable.dropna()
method with subset
subset
parameter a list of variables.dropna()
method with thresh
thresh
parameter specifies a minimum threshold of non-missing values that an observation must have for pandas to keep it.duplicated()
methodduplicated()
method returns a Boolean Series
that identifies duplicates in a variable.duplicated()
methodemp["Team"].duplicated(keep = "first")
emp["Team"].duplicated(keep = "last")
~emp["Team"].duplicated()
duplicated()
method’s keep
parameter informs pandas which duplicate occurrence to keep.
"first"
, keeps the first occurrence of each duplicate value."last"
, keeps the last occurrence of each duplicate value.drop_duplicates()
methoddrop_duplicates()
method removes observations in which all values are equal to those in a previously encountered observations.drop_duplicates()
methoddrop_duplicates()
method:# Sample DataFrame with duplicate observations
data = {
'Name': ['John', 'Anna', 'John', 'Mike', 'Anna'],
'Age': [28, 23, 28, 32, 23],
'City': ['New York', 'Paris', 'New York', 'London', 'Paris']
}
# pd.DataFrame( Series, List, or Dict ) creates a DataFrame
df = pd.DataFrame(data)
df_unique = df.drop_duplicates()
drop_duplicates()
methodWe can pass the drop_duplicates()
method a subset
parameter with a list of variables that pandas should use to determine an observation’s uniqueness.
drop_duplicates()
methodGender
and Team
variables to identify duplicates.drop_duplicates()
methodemp.drop_duplicates(subset = ["Team"], keep = "last")
emp.drop_duplicates(subset = ["Team"], keep = False)
The drop_duplicates()
method also accepts a keep
parameter.
"last"
to keep the observations with each duplicate value’s last occurrence.False
to exclude all observations with duplicate values.Q. What does emp.drop_duplicates(subset = ["First Name"], keep = False)
do?
Q. Find a subset of all employees with a First Name of “Douglas” and a Gender of “Male”. Then check which “Douglas” is in the DataFrame emp.drop_duplicates(subset = ["Gender", "Team"])
.
Let’s do Questions 1-3 in Part 1 of Classwork 5!
The pandas library’s GroupBy
object is a storage container for grouping observations into buckets.
It provides a set of methods to aggregate and analyze each independent group in the collection.
It allows us to extract observations at specific index positions within each group.
It also offers a convenient way to iterate over the groups of observations.
GroupBy
Objectpd.DataFrame()
converts a List
/Dictionary
/Series
into a DataFrame
ojbectGroupBy
ObjectType
variable identifies the group to which an Item
belongs.
GroupBy
ObjectGroupBy
object implicitly organizes observations into buckets based on shared values in a categorical variable.GroupBy
ObjectGroupBy
Objectgroupby()
method on the supermarket DataFrame
.
groupby()
a categorical variable whose values will be used to create the groups.DataFrameGroupBy
object is separate and distinct from a DataFrame
GroupBy
ObjectGroupBy
object will store two groups.get_group()
method accepts a group name and returns a DataFrame
with the corresponding observations.GroupBy
Objectfortune_2023.csv
file is a collection of Fortune 1000 companies from 2023 (Source: Kaggle).fortune1000 = pd.read_csv("https://bcdanl.github.io/data/fortune_2023.csv")
varlist = ['Company', 'Revenues_M', 'Profits_M', 'Number_of_employees', 'Sector', 'Industry']
fortune = fortune1000[varlist]
fortune
DataFrame
.
fortune
?GroupBy
object from a datasetSector
value of "Retailing"
.
in_retailing = ( fortune["Sector"] == "Retailing" )
retail_companies = fortune[ in_retailing ]
retail_companies["Revenues"].mean()
fortune
.
GroupBy
object offers the best solution out of the box.GroupBy
object from a datasetWe can count the number of groups in sectors by passing the GroupBy
object into the Python’s built-in len()
function
The size()
method on the GroupBy
object returns a Series
with an alphabetical list of the groups and their observation counts.
GroupBy
objectgroups
attribute stores a dictionary with associations of group-to-observations.GroupBy
objectGroupBy
object’s first()/last()
method extracts the first/last observation listed for each group in a DataFrame
.
fortune
DataFrame
is sorted by Revenue_M
, the first company pulled out for each sector will be the highest-performing company within that sector.GroupBy
objectnth()
method is used with a GroupBy
object to select the nth observation from each group.
fortune
.GroupBy
objecthead(n)/tail(n)
method extracts the first/last n observations from each group.GroupBy
objectget_group()
method to extract all observations in a given group.
DataFrame
containing the observations.GroupBy
object to apply aggregate operations to every group.
sectors["Revenues_M"]
sectors["Revenues"].sum()
sectors["Revenues"].mean()
sectors["Revenues"].max()
sectors["Revenues"].min()
GroupBy
object.
SeriesGroupBy
.sectors.agg(
Revenues_M_min = ("Revenues_M", "min"),
Profits_M_max = ("Profits_M", "max"),
Number_of_employees_mean = ("Revenues_M", "mean")
)
agg
method applies multiple aggregate operations to different variables and can accept a tuple as its argument.agg()
method can also be used on a Series
.
GroupBy
object.
fortune
.sample(frac=1)
: This samples 100% of the observations from the DataFrame
, effectively shuffling it.
frac
parameter specifies the fraction of observations to return in the random sample, so frac=1
means “return all observations”.GroupBy
object.
fortune
.reset_index()
: This resets the index of the DataFrame
.
drop = True
option is used to prevent the old index from being added as a variable in the new DataFrame
.Let’s do Questions 1-6 in Part 2 of Classwork 5!
sort_values()
with groupby()
nlargest()
with groupby()
DataFrameGroupBy
has no methods sort_values()
or nlargest()
.A function can take any number and type of input parameters and return any number and type of output results.
We can do two things with a function:
To define a Python function, we type def
, the function name, parentheses enclosing any input parameters to the function, and then finally, a colon (:
).
Let’s define a very simple function my_half()
that has a parameter x
and returns a value x / 2
.
The values we pass into the function when we call it are known as arguments.
When we call a function with arguments, the values of those arguments are copied to their corresponding parameters inside the function.
nlargest()
method to each group?
DataFrame
.apply()
method the custom function.def get_largest_obs(df):
return df.nlargest(1, "Revenues_M", keep="all")
sectors.apply(get_largest_obs)
apply()
method when pandas does not support a custom aggregation we would like to apply to each group.GroupBy
object with values from multiple variables.
The GroupBy
object’s size()
method now returns a MultiIndex Series
with a count of observations for each internal group.
The reset_index()
method can be used to convert a Series
into a DataFrame
.
name
option renames a variable of index when resetting index.get_group()
method requires a tuple of values to extract a nested DataFrame
from the GroupBy
collection.sector_and_industry.sum()
sector_and_industry["Revenues_M"].mean()
(
sector_and_industry["Revenues_M"]
.mean()
.reset_index(name = "Revenues_mean")
)
MultiIndex DataFrame
with the calculations.Let’s do Question 7 in Part 2 of Classwork 5!