Pandas Basics - Sorting & Indexing Data; Mathematical & Vectorized Operations
February 20, 2024
DataFrame
DataFrame
DataFrame
DataFrame
DataFrame
DataFrame
DataFrame
DataFrame
DataFrame
DataFrame
DataFrame
nba.csv
as nba
.# Below is to import the pandas library as pd
import pandas as pd
# Below is for an interactive display of DataFrame in Colab
from google.colab import data_table
data_table.enable_dataframe_formatter()
# Below is to read nba.csv as nba DataFrame
nba = pd.read_csv("https://bcdanl.github.io/data/nba.csv",
parse_dates = ["Birthday"])
# Below is to view the nba DataFrame and to get a summary of it
nba
nba.info()
nba.describe( include="all" )
sort_values()
sort_values()
method’s first parameter, by
, accepts the variables that pandas should use to sort the DataFrame
.sort_values()
sort_values()
method’s ascending
parameter determines the sort order.
ascending
has a default argument of True
.DataFrame
has various methods that modify the existing DataFrame
.nsmallest()
and nlargest()
nsmallest()
are useful to get the first n
observations ordered by a variable in ascending order.
nlargest()
are useful to get the first n
observations ordered by a variable in descending order.
nsmallest()
and nlargest()
keep = "all"
keeps all duplicates, even it means selecting more than n
observations.sort_values()
DataFrame
by multiple columns by passing a list to the by
parameter.sort_values()
ascending
parameter to apply the same sort order to each variable.sort_values()
ascending
parameter.sort_values()
Q. Which players on each team are paid the most?
sort_index()
How can we return it to its original form of DataFrame
?
Our nba
DataFrame
still has its numeric index labels.
sort_index()
sorts observations by their index labels (row names).
sort_index()
sort_index()
method can also be used to change the order of variables in an alphabetical order.
axis
parameter and pass it an argument of "columns"
or 1
.set_index()
method when we want to change the current index of a DataFrame
to one or more existing columns.
set_index()
set_index()
method returns a new DataFrame
with a given column set as the index.
keys
, accepts the column name.reset_index()
reset_index()
method:
DataFrame
column;inplace=True
, the operation alters the original DataFrame
directly.We can extract observations, variables, and values from a DataFrame
by using the loc[]
and iloc[]
accessors.
.loc[ Index Labels ]
nba
with the Name
index.# The two lines below are equivalent
nba = nba.set_index("Name")
nba.set_index("Name", inplace = True)
.loc
attribute extracts an observation by index label (row name)..loc[ Index Labels ]
.loc[ Index Labels ]
loc[:]
to pull rows:
DataFrame
to its end;DataFrame
to a specific index label..iloc[ Index Positions ]
.iloc
(index location) attribute locates rows by index position.
.iloc[:]
is similar to the slicing syntax with strings/lists.
Let’s do Questions 1-4 in Part 1 of Classwork 3!
loc[Rows, Columns]
or iloc[Rows, Columns]
.loc
and .iloc
attributes accept a second argument representing the column(s) to extract.
.loc
, we have to provide the column names.loc[Rows, Columns]
or iloc[Rows, Columns]
.loc
and .iloc
attributes accept a second argument representing the column(s) to extract.
.iloc
, we have to provide the column position.nba.columns
rename( columns = { "Existing One" : "New One" } )
rename()
method renames the variable Date of Birth to Birthday.rename( index = { "Existing One" : "New One" } )
rename()
method renames the observation LeBron James to LeBron Raymone James.max()
method returns a Series
with the maximum value from each variable.min()
method returns a Series
with the minimum value from each variable.sum()
/mean()
/median()
method returns a Series
with the sum/mean/median of the values in each variable.quantile()
method returns a Series
with the percentile value of the values in each variable (e.g., 25th, 75th, 90th percentile).std()
method returns a Series
with the standard deviation of the values in each variable.True
to the sum()
/mean()
/median()
/std()
method’s numeric_only
parameter.nba["Salary_2x"] = nba["Salary"] + nba["Salary"]
nba["Name_w_Position"] = nba["Name"] + " (" + nba["Position"] + ")"
nba["Salary_minus_Mean"] = nba["Salary"] - nba["Salary"].mean()
pandas
performs a vectorized operation on Series
or a variable in DataFrame
.
astype()
MethodMgmt
variable?astype()
methodastype()
method converts a Series
’ values to a different data type.
astype()
methodMgmt
variable with our new Series
of Booleans.astype()
methodSalary
variable’s values to integers with the astype()
method.
NaN
values to integers.fillna()
methodfillna()
method replaces a Series
’ missing values with the argument we pass in.0
.
0
is passed solely for the sake of example.fillna()
methodSalary
variable with our new Series
of integers.astype()
methodPandas includes a special data type called a category
, - It is ideal for a variable consisting of a small number of unique values relative to its total size. - E.g., gender, weekdays, blood types, planets, and income groups.
pd.to_datetime()
methodpd.to_datetime()
function is used to convert a Series
, DataFrame
, or a single variable of a DataFrame
from its current data type into datetime
format.Let’s do Question 1 in Part 2 of Classwork 3!
We may often not know the index labels and positions of the observations we want to target.
We may want to target observations not by an index label but by a Boolean condition.
Series
with a constant value, we place the Series
on one side of the equality operator (==
) and the value on the other.
Series == value
First Name
value with “Donna”.Series
between square brackets following the DataFrame
.
DataFrame[ Boolean_Series ]
Series
to an object and then pass it into the square brackets instead.What if we want to extract a subset of employees who are not on the “Marketing
” team?
True
denotes that the Team
value for a given index is not “Marketing
”, and False
indicates the Team
value is “Marketing
”What if we want to retrieve all the managers in the company? Managers have a value of True
in the Mgmt
variable.
emp["Mgmt"] == True
, but we do not need to.sales = emp["Team"] == "Sales"
legal = emp["Team"] == "Legal"
fnce = emp["Team"] == "Finance"
emp[ sales | legal | fnce ]
Series
inside the square brackets and add the |
symbol to declare OR
criteria.isin()
methodstar_teams = ["Sales", "Legal", "Finance"]
on_star_teams = emp["Team"].isin(star_teams)
emp[on_star_teams]
What if our next report asked for employees from 15 teams instead of three?
A better solution is the isin()
method, which accepts an iterable (e.g., list
, tuple
, Series
) and returns a Boolean Series
.
higher_than_90k = emp["Salary"] >= 90000
lower_than_100k = emp["Salary"] < 100000
emp[ higher_than_90k & lower_than_100k ]
We can create two Boolean Series
, one to declare the lower bound and one to declare the upper bound.
Then we can use the &
operator to mandate that both conditions are True
:
between()
methodbetween()
True
denotes that a row’s value falls between the specified interval.between()
methodbetween()
method to string variables.query()
method# Below two are equivalent
emp.query("Salary >= 100000 & Team == 'Finance'")
emp.query("Salary >= 100000 and Team == 'Finance'")
The query()
method filters observations using a concise, string-based query syntax.
The query()
method is often more readable and more concise, especially for complex conditions, compared to traditional Boolean filtering.
Let’s do Questions 2-6 in Part 2 of Classwork 3!