Lecture 3

Pandas Basics - Sorting & Indexing Data; Mathematical & Vectorized Operations

Byeong-Hak Choe

bchoe@geneseo.edu

SUNY Geneseo

February 20, 2024

Pandas Basics

Learning Objectives

Loading DataFrame
Getting a Summary of DataFrame
Selecting Variables in DataFrame
Counting Values in DataFrame

Sorting DataFrame
Indexing DataFrame
Locating Observations and Values in DataFrame
Renaming Variables in DataFrame
Mathematical & Vectorized Operations with DataFrame
Converting Data Types in DataFrame
Filtering Observations in DataFrame

Sorting Methods

Pandas Basics

Let’s read nba.csv as nba.

# Below is to import the pandas library as pd
import pandas as pd 

# Below is for an interactive display of DataFrame in Colab
from google.colab import data_table  
data_table.enable_dataframe_formatter()

# Below is to read nba.csv as nba DataFrame
nba = pd.read_csv("https://bcdanl.github.io/data/nba.csv",
                  parse_dates = ["Birthday"])
                  
# Below is to view the nba DataFrame and to get a summary of it
nba
nba.info()
nba.describe( include="all" )

Pandas Basics

Sorting by a Single Variable with `sort_values()`

# The two lines below are equivalent
nba.sort_values(["Name"])
nba.sort_values(by = ["Name"])

The sort_values() method’s first parameter, by, accepts the variables that pandas should use to sort the DataFrame.

Pandas Basics

Sorting by a Single Variable with `sort_values()`

nba.sort_values(["Name"], ascending = False)

The sort_values() method’s ascending parameter determines the sort order.
- ascending has a default argument of True.
- By default, pandas will sort:
  - A variable of numbers in increasing order;
  - A variable of strings in alphabetical order;
  - A variable of datetimes in chronological order.

Pandas Basics

Method Chaining

(
    nba
    .sort_values(['Salary'])
    .head(5)
)

DataFrame has various methods that modify the existing DataFrame.
Method Chaining: We can call methods sequentially without the need to store intermediate results.

Pandas Basics

Sorting by a Single Variable with `nsmallest()` and `nlargest()`

nba.nsmallest(5, 'Salary')
nba.nlargest(5, 'Salary')

nsmallest() are useful to get the first n observations ordered by a variable in ascending order.
nlargest() are useful to get the first n observations ordered by a variable in descending order.

Pandas Basics

Sorting by a Single Variable with `nsmallest()` and `nlargest()`

nba.nsmallest(4, 'Salary', keep = "all")
nba.nlargest(4, 'Salary', keep = "all")

keep = "all" keeps all duplicates, even it means selecting more than n observations.

Pandas Basics

Sorting by Multiple Variables with `sort_values()`

nba.sort_values(["Team", "Name"])
nba.sort_values(by = ["Team", "Name"])

We can sort a DataFrame by multiple columns by passing a list to the by parameter.

Pandas Basics

Sorting by Multiple Variables with `sort_values()`

nba.sort_values(by = ["Team", "Name"], 
                ascending = False)

We can pass a single Boolean to the ascending parameter to apply the same sort order to each variable.

Pandas Basics

Sorting by Multiple Variables with `sort_values()`

nba.sort_values(by = ["Team", "Name"], 
                ascending = [False, True])

If we want to sort each variable in a different order, we can pass a Boolean list to the ascending parameter.

Pandas Basics

Sorting by Multiple Variables with `sort_values()`

Q. Which players on each team are paid the most?

Pandas Basics

Sorting by Row Index with `sort_index()`

# Below lines are equivalent
nba.sort_index()
nba.sort_index(ascending = True)

nba.sort_index(ascending = False)

How can we return it to its original form of DataFrame?
Our nba DataFrame still has its numeric index labels.
sort_index() sorts observations by their index labels (row names).

Pandas Basics

Changing the Order of Variables with `sort_index()`

# The two lines below are equivalent
nba.sort_index(axis = "columns")
nba.sort_index(axis = 1)

The sort_index() method can also be used to change the order of variables in an alphabetical order.
- We need to add an axis parameter and pass it an argument of "columns" or 1.

Setting a New Index

Pandas Basics

Setting a New Index

We can use the set_index() method when we want to change the current index of a DataFrame to one or more existing columns.
- This is particularly useful when:
  - We have a column that uniquely identifies each observation (e.g., ID);
  - We sometimes want to use an unique identifier as the index for more efficient data wrangling.

Pandas Basics

Setting a New Index with `set_index()`

# The two lines below are equivalent
nba.set_index(keys = "Name")
nba.set_index("Name")

The set_index() method returns a new DataFrame with a given column set as the index.
- Its first parameter, keys, accepts the column name.

Pandas Basics

Re-setting an Index with `reset_index()`

nba2 = nba.set_index("Name")
nba2.reset_index(inplace=True)    # Useful for the method chaining

We use the reset_index() method:
- When we want to convert the index back into a DataFrame column;
- When we need to reset the index to the default integer index.
Note: With inplace=True, the operation alters the original DataFrame directly.

Locating Observations

Pandas Basics

Locating Observations/Values

We can extract observations, variables, and values from a DataFrame by using the loc[] and iloc[] accessors.
- These accessors work well when we know the index labels and positions of the observations/variables we want to target.

Pandas Basics

Locating Observations by `.loc[ Index Labels ]`

Let’s consider the nba with the Name index.

# The two lines below are equivalent
nba = nba.set_index("Name")
nba.set_index("Name", inplace = True)

Below extracts observations:

nba.loc[ "LeBron James" ]
nba.loc[ ["Kawhi Leonard", "Paul George"] ]

The .loc attribute extracts an observation by index label (row name).

Pandas Basics

Locating Observations by `.loc[ Index Labels ]`

(
    nba
    .sort_index()
    .loc["Otto Porter":"Patrick Beverley"]
)

What is the above code doing?
- Note: Both the starting value and the ending value are inclusive.

Pandas Basics

Locating Observations by `.loc[ Index Labels ]`

(
    nba
    .sort_index()
    .loc["Zach Collins":]
)

(
    nba
    .sort_index()
    .loc[:"Al Horford"]
)

We can use loc[:] to pull rows:
- From the middle of the DataFrame to its end;
- From the beginning of the DataFrame to a specific index label.

Pandas Basics

Locating Observations by `.iloc[ Index Positions ]`

nba.iloc[ 300 ]
nba.iloc[ [100, 200, 300, 400] ]

nba.iloc[400:404]
nba.iloc[:2]
nba.iloc[447:]
nba.iloc[-10:-6]
nba.iloc[0:10:2] # every other rows

The .iloc (index location) attribute locates rows by index position.
- This can be helpful when the position of rows has significance in our data set.
- We pass integers.
The .iloc[:] is similar to the slicing syntax with strings/lists.
- The end value is NOT inclusive.

Pandas Basics

Let’s do Questions 1-4 in Part 1 of Classwork 3!

Locating Values

Pandas Basics

Locating Values by `loc[Rows, Columns]` or `iloc[Rows, Columns]`

nba.loc[
    "LeBron James",
    "Team"
]

nba.loc[
     "James Harden", 
      ["Position", "Birthday"] 
]

nba.loc[
    ["Russell Westbrook", "Anthony Davis"],
     ["Team", "Salary"]
]

nba.loc[
    "Joel Embiid", 
    "Position":"Salary"
]

Both the .loc and .iloc attributes accept a second argument representing the column(s) to extract.
- If we are using .loc, we have to provide the column names.

Pandas Basics

Locating Values by `loc[Rows, Columns]` or `iloc[Rows, Columns]`

nba.iloc[
    57, 
    3
]

nba.iloc[
    100:104, 
    :3
]

Both the .loc and .iloc attributes accept a second argument representing the column(s) to extract.
- If we are using .iloc, we have to provide the column position.

Renaming columns or rows

Pandas Basics

Renaming columns with `nba.columns`

Do you recall the .columns attribute?

nba.columns

We can rename any or all of a DataFrame’s columns by assigning a list of new names to the attribute:

nba.columns = ["Team", "Position", "Date of Birth", "Income"]

Pandas Basics

Renaming columns with `rename( columns = { "Existing One" : "New One" } )`

nba.rename( columns = { "Date of Birth": "Birthday" } )

The above rename() method renames the variable Date of Birth to Birthday.

Pandas Basics

Renaming rows with `rename( index = { "Existing One" : "New One" } )`

nba = nba.rename(
    index = { "LeBron James": "LeBron Raymone James" }
)

The above rename() method renames the observation LeBron James to LeBron Raymone James.

Mathematical & Vectorized Operations

Pandas Basics

Mathematical Operations

nba.max()
nba.min()

The max() method returns a Series with the maximum value from each variable.
The min() method returns a Series with the minimum value from each variable.

Pandas Basics

Mathematical Operations

nba.sum()
nba.mean()
nba.median()
nba.quantile(0.75) # 0 to 1
nba.std()

nba.sum(numeric_only = True)
nba.mean(numeric_only = True)
nba.median(numeric_only = True)
nba.quantile(0.75, numeric_only=True)
nba.std(numeric_only = True)

The sum()/mean()/median() method returns a Series with the sum/mean/median of the values in each variable.
The quantile() method returns a Series with the percentile value of the values in each variable (e.g., 25th, 75th, 90th percentile).
The std() method returns a Series with the standard deviation of the values in each variable.
To limit the operation to numeric volumes, we can pass True to the sum()/mean()/median()/std() method’s numeric_only parameter.

Pandas Basics

Vectorized Operations

nba["Salary_2x"] = nba["Salary"] + nba["Salary"]
nba["Name_w_Position"] = nba["Name"] + " (" + nba["Position"] + ")"
nba["Salary_minus_Mean"] = nba["Salary"] - nba["Salary"].mean()

pandas performs a vectorized operation on Series or a variable in DataFrame.
- This means an element-by-element operation.
- This enables us to apply functions and perform operations on the data efficiently, without the need for explicit loops.

Converting Data Types with the `astype()` Method

Pandas Basics

Let’s read employment.csv as emp.

emp = pd.read_csv("https://bcdanl.github.io/data/employment.csv")

What values are in the Mgmt variable?

Pandas Basics

Converting Data Types with the `astype()` method

emp["Mgmt"].astype(bool)

The astype() method converts a Series’ values to a different data type.
- It accepts a single argument: the new data type.
- We can pass either the data type or a string with its name.

Pandas Basics

Converting Data Types with the `astype()` method

emp["Mgmt"] = emp["Mgmt"].astype(bool)

The above code overwrites the Mgmt variable with our new Series of Booleans.

Pandas Basics

Converting Data Types with the `astype()` method

emp["Salary"].astype(int)

The above code tries to coerce the Salary variable’s values to integers with the astype() method.
- Pandas is unable to convert the NaN values to integers.

Pandas Basics

Fill Missing Values with the `fillna()` method

emp["Salary"].fillna(0)

The fillna() method replaces a Series’ missing values with the argument we pass in.
The above example provides a fill value of 0.
- Note that your choice of value can distort the data; 0 is passed solely for the sake of example.

Pandas Basics

Fill Missing Values with the `fillna()` method

emp["Salary"] = emp["Salary"].fillna(0).astype(int)

The above code overwrites the Salary variable with our new Series of integers.

Pandas Basics

Converting Data Types with the `astype()` method

emp["Gender"] = emp["Gender"].astype("category")

Pandas includes a special data type called a category, - It is ideal for a variable consisting of a small number of unique values relative to its total size. - E.g., gender, weekdays, blood types, planets, and income groups.

Pandas Basics

Converting Data Types with the `pd.to_datetime()` method

emp["Start Date"] = pd.to_datetime(emp["Start Date"])

The pd.to_datetime() function is used to convert a Series, DataFrame, or a single variable of a DataFrame from its current data type into datetime format.

Pandas Basics

Let’s do Question 1 in Part 2 of Classwork 3!

Filtering by a Single Condition

Pandas Basics

Filtering by a Condition

We may often not know the index labels and positions of the observations we want to target.
We may want to target observations not by an index label but by a Boolean condition.

Pandas Basics

Filtering by a Single Condition

emp["First Name"] == "Donna"

To compare every value in Series with a constant value, we place the Series on one side of the equality operator (==) and the value on the other.
- Series == value
The above example compares each First Name value with “Donna”.

Pandas Basics

Filtering by a Single Condition

emp[ emp["First Name"] == "Donna" ]

To filter observations, we provide the Boolean Series between square brackets following the DataFrame.
- DataFrame[ Boolean_Series ]

Pandas Basics

Filtering by a Single Condition

donnas = emp["First Name"] == "Donna"
emp[donnas]

If the use of multiple square brackets is confusing, we can assign the Boolean Series to an object and then pass it into the square brackets instead.

Pandas Basics

Filtering by a Single Condition

What if we want to extract a subset of employees who are not on the “Marketing” team?

non_marketing = emp["Team"] != "Marketing"  # != means "not equal to"
emp[ non_marketing ]

True denotes that the Team value for a given index is not “Marketing”, and False indicates the Team value is “Marketing”

Pandas Basics

Filtering by a Single Condition

What if we want to retrieve all the managers in the company? Managers have a value of True in the Mgmt variable.

emp[ emp["Mgmt"] ]

We could execute emp["Mgmt"] == True, but we do not need to.

Pandas Basics

Filtering by a Single Condition

high_earners = emp["Salary"] > 100000
emp[ high_earners ]

We can also use arithmetic operands to filter observations based on mathematical conditions.

Filtering by a Condition

Pandas Basics

Filtering by a Condition

sales = emp["Team"] == "Sales"
legal = emp["Team"] == "Legal"
fnce = emp["Team"] == "Finance"
emp[ sales | legal | fnce ]

We could provide three separate Boolean Series inside the square brackets and add the | symbol to declare OR criteria.

Pandas Basics

Filtering with the `isin()` method

star_teams = ["Sales", "Legal", "Finance"]
on_star_teams = emp["Team"].isin(star_teams)
emp[on_star_teams]

What if our next report asked for employees from 15 teams instead of three?
A better solution is the isin() method, which accepts an iterable (e.g., list, tuple, Series) and returns a Boolean Series.

Pandas Basics

Filtering by a Condition

When working with numbers or dates, we often want to extract values that fall within a range.
- E.g., Identify all employees with a salary between $90,000 and $100,000.

Pandas Basics

Filtering by a Condition

higher_than_90k = emp["Salary"] >= 90000
lower_than_100k = emp["Salary"] < 100000
emp[ higher_than_90k & lower_than_100k ]

We can create two Boolean Series, one to declare the lower bound and one to declare the upper bound.
Then we can use the & operator to mandate that both conditions are True:

Pandas Basics

Filtering with the `between()` method

between_90k_and_100k = emp["Salary"].between(90000, 100000)
emp[between_90k_and_100k]

A slightly cleaner solution is to use a method called between()
- It returns a Boolean Series where True denotes that a row’s value falls between the specified interval.
- The first argument, the lower bound, is inclusive, and the second argument, the upper bound, is exclusive.

Pandas Basics

Filtering with the `between()` method

name_starts_with_t = emp["First Name"].between("T", "U")
emp[name_starts_with_t]

We can also apply the between() method to string variables.

Pandas Basics

Filtering by a Condition with the `query()` method

# Below two are equivalent
emp.query("Salary >= 100000 & Team == 'Finance'")
emp.query("Salary >= 100000 and Team == 'Finance'")

The query() method filters observations using a concise, string-based query syntax.
The query() method is often more readable and more concise, especially for complex conditions, compared to traditional Boolean filtering.

Lecture 3

Pandas Basics

Pandas Basics

Learning Objectives

Sorting Methods

Pandas Basics

Pandas Basics

Sorting by a Single Variable with sort_values()

Pandas Basics

Sorting by a Single Variable with sort_values()

Pandas Basics

Method Chaining

Pandas Basics

Sorting by a Single Variable with nsmallest() and nlargest()

Pandas Basics

Sorting by a Single Variable with nsmallest() and nlargest()

Pandas Basics

Sorting by Multiple Variables with sort_values()

Pandas Basics

Sorting by Multiple Variables with sort_values()

Pandas Basics

Sorting by Multiple Variables with sort_values()

Pandas Basics

Sorting by Multiple Variables with sort_values()

Pandas Basics

Sorting by Row Index with sort_index()

Pandas Basics

Changing the Order of Variables with sort_index()

Setting a New Index

Pandas Basics

Setting a New Index

Pandas Basics

Setting a New Index with set_index()

Pandas Basics

Re-setting an Index with reset_index()

Locating Observations

Pandas Basics

Locating Observations/Values

Pandas Basics

Locating Observations by .loc[ Index Labels ]

Pandas Basics

Locating Observations by .loc[ Index Labels ]

Pandas Basics

Locating Observations by .loc[ Index Labels ]

Pandas Basics

Locating Observations by .iloc[ Index Positions ]

Pandas Basics

Locating Values

Pandas Basics

Locating Values by loc[Rows, Columns] or iloc[Rows, Columns]

Pandas Basics

Locating Values by loc[Rows, Columns] or iloc[Rows, Columns]

Renaming columns or rows

Pandas Basics

Renaming columns with nba.columns

Pandas Basics

Renaming columns with rename( columns = { "Existing One" : "New One" } )

Pandas Basics

Renaming rows with rename( index = { "Existing One" : "New One" } )

Mathematical & Vectorized Operations

Pandas Basics

Mathematical Operations

Pandas Basics

Mathematical Operations

Pandas Basics

Vectorized Operations

Converting Data Types with the astype() Method

Pandas Basics

Pandas Basics

Converting Data Types with the astype() method

Pandas Basics

Converting Data Types with the astype() method

Pandas Basics

Converting Data Types with the astype() method

Pandas Basics

Fill Missing Values with the fillna() method

Pandas Basics

Fill Missing Values with the fillna() method

Pandas Basics

Converting Data Types with the astype() method

Sorting by a Single Variable with `sort_values()`

Sorting by a Single Variable with `sort_values()`

Sorting by a Single Variable with `nsmallest()` and `nlargest()`

Sorting by a Single Variable with `nsmallest()` and `nlargest()`

Sorting by Multiple Variables with `sort_values()`

Sorting by Multiple Variables with `sort_values()`

Sorting by Multiple Variables with `sort_values()`

Sorting by Multiple Variables with `sort_values()`

Sorting by Row Index with `sort_index()`

Changing the Order of Variables with `sort_index()`

Setting a New Index with `set_index()`

Re-setting an Index with `reset_index()`

Locating Observations by `.loc[ Index Labels ]`

Locating Observations by `.loc[ Index Labels ]`

Locating Observations by `.loc[ Index Labels ]`

Locating Observations by `.iloc[ Index Positions ]`

Locating Values by `loc[Rows, Columns]` or `iloc[Rows, Columns]`

Locating Values by `loc[Rows, Columns]` or `iloc[Rows, Columns]`

Renaming columns with `nba.columns`

Renaming columns with `rename( columns = { "Existing One" : "New One" } )`

Renaming rows with `rename( index = { "Existing One" : "New One" } )`

Converting Data Types with the `astype()` Method

Converting Data Types with the `astype()` method

Converting Data Types with the `astype()` method

Converting Data Types with the `astype()` method

Fill Missing Values with the `fillna()` method

Fill Missing Values with the `fillna()` method

Converting Data Types with the `astype()` method

Converting Data Types with the `pd.to_datetime()` method

Filtering with the `isin()` method

Filtering with the `between()` method

Filtering with the `between()` method

Filtering by a Condition with the `query()` method