Lecture 3

Pandas Basics - Sorting & Indexing Data; Mathematical & Vectorized Operations

Byeong-Hak Choe

SUNY Geneseo

February 20, 2024

Pandas Basics

Pandas Basics

Learning Objectives

  • Loading DataFrame
  • Getting a Summary of DataFrame
  • Selecting Variables in DataFrame
  • Counting Values in DataFrame
  • Sorting DataFrame
  • Indexing DataFrame
  • Locating Observations and Values in DataFrame
  • Renaming Variables in DataFrame
  • Mathematical & Vectorized Operations with DataFrame
  • Converting Data Types in DataFrame
  • Filtering Observations in DataFrame

Sorting Methods

Pandas Basics

  • Let’s read nba.csv as nba.
# Below is to import the pandas library as pd
import pandas as pd 

# Below is for an interactive display of DataFrame in Colab
from google.colab import data_table  
data_table.enable_dataframe_formatter()

# Below is to read nba.csv as nba DataFrame
nba = pd.read_csv("https://bcdanl.github.io/data/nba.csv",
                  parse_dates = ["Birthday"])
                  
# Below is to view the nba DataFrame and to get a summary of it
nba
nba.info()
nba.describe( include="all" )

Pandas Basics

Sorting by a Single Variable with sort_values()

# The two lines below are equivalent
nba.sort_values(["Name"])
nba.sort_values(by = ["Name"])
  • The sort_values() method’s first parameter, by, accepts the variables that pandas should use to sort the DataFrame.

Pandas Basics

Sorting by a Single Variable with sort_values()

nba.sort_values(["Name"], ascending = False)
  • The sort_values() method’s ascending parameter determines the sort order.
    • ascending has a default argument of True.
    • By default, pandas will sort:
      • A variable of numbers in increasing order;
      • A variable of strings in alphabetical order;
      • A variable of datetimes in chronological order.

Pandas Basics

Method Chaining

(
    nba
    .sort_values(['Salary'])
    .head(5)
)
  • DataFrame has various methods that modify the existing DataFrame.
  • Method Chaining: We can call methods sequentially without the need to store intermediate results.

Pandas Basics

Sorting by a Single Variable with nsmallest() and nlargest()

nba.nsmallest(5, 'Salary')
nba.nlargest(5, 'Salary')
  • nsmallest() are useful to get the first n observations ordered by a variable in ascending order.

  • nlargest() are useful to get the first n observations ordered by a variable in descending order.

Pandas Basics

Sorting by a Single Variable with nsmallest() and nlargest()

nba.nsmallest(4, 'Salary', keep = "all")
nba.nlargest(4, 'Salary', keep = "all")
  • keep = "all" keeps all duplicates, even it means selecting more than n observations.

Pandas Basics

Sorting by Multiple Variables with sort_values()

nba.sort_values(["Team", "Name"])
nba.sort_values(by = ["Team", "Name"])
  • We can sort a DataFrame by multiple columns by passing a list to the by parameter.

Pandas Basics

Sorting by Multiple Variables with sort_values()

nba.sort_values(by = ["Team", "Name"], 
                ascending = False)
  • We can pass a single Boolean to the ascending parameter to apply the same sort order to each variable.

Pandas Basics

Sorting by Multiple Variables with sort_values()

nba.sort_values(by = ["Team", "Name"], 
                ascending = [False, True])
  • If we want to sort each variable in a different order, we can pass a Boolean list to the ascending parameter.

Pandas Basics

Sorting by Multiple Variables with sort_values()

Q. Which players on each team are paid the most?

Pandas Basics

Sorting by Row Index with sort_index()

# Below lines are equivalent
nba.sort_index()
nba.sort_index(ascending = True)
nba.sort_index(ascending = False)
  • How can we return it to its original form of DataFrame?

  • Our nba DataFrame still has its numeric index labels.

  • sort_index() sorts observations by their index labels (row names).

Pandas Basics

Changing the Order of Variables with sort_index()

# The two lines below are equivalent
nba.sort_index(axis = "columns")
nba.sort_index(axis = 1)
  • The sort_index() method can also be used to change the order of variables in an alphabetical order.
    • We need to add an axis parameter and pass it an argument of "columns" or 1.

Setting a New Index

Pandas Basics

Setting a New Index

  • We can use the set_index() method when we want to change the current index of a DataFrame to one or more existing columns.
    • This is particularly useful when:
      • We have a column that uniquely identifies each observation (e.g., ID);
      • We sometimes want to use an unique identifier as the index for more efficient data wrangling.

Pandas Basics

Setting a New Index with set_index()

# The two lines below are equivalent
nba.set_index(keys = "Name")
nba.set_index("Name")
  • The set_index() method returns a new DataFrame with a given column set as the index.
    • Its first parameter, keys, accepts the column name.

Pandas Basics

Re-setting an Index with reset_index()

nba2 = nba.set_index("Name")
nba2.reset_index(inplace=True)    # Useful for the method chaining
  • We use the reset_index() method:
    • When we want to convert the index back into a DataFrame column;
    • When we need to reset the index to the default integer index.
  • Note: With inplace=True, the operation alters the original DataFrame directly.

Locating Observations

Pandas Basics

Locating Observations/Values

  • We can extract observations, variables, and values from a DataFrame by using the loc[] and iloc[] accessors.

    • These accessors work well when we know the index labels and positions of the observations/variables we want to target.

Pandas Basics

Locating Observations by .loc[ Index Labels ]

  • Let’s consider the nba with the Name index.
# The two lines below are equivalent
nba = nba.set_index("Name")
nba.set_index("Name", inplace = True)
  • Below extracts observations:
nba.loc[ "LeBron James" ]
nba.loc[ ["Kawhi Leonard", "Paul George"] ]
  • The .loc attribute extracts an observation by index label (row name).

Pandas Basics

Locating Observations by .loc[ Index Labels ]

(
    nba
    .sort_index()
    .loc["Otto Porter":"Patrick Beverley"]
)
  • What is the above code doing?
    • Note: Both the starting value and the ending value are inclusive.

Pandas Basics

Locating Observations by .loc[ Index Labels ]

(
    nba
    .sort_index()
    .loc["Zach Collins":]
)
(
    nba
    .sort_index()
    .loc[:"Al Horford"]
)
  • We can use loc[:] to pull rows:
    • From the middle of the DataFrame to its end;
    • From the beginning of the DataFrame to a specific index label.

Pandas Basics

Locating Observations by .iloc[ Index Positions ]

nba.iloc[ 300 ]
nba.iloc[ [100, 200, 300, 400] ]
nba.iloc[400:404]
nba.iloc[:2]
nba.iloc[447:]
nba.iloc[-10:-6]
nba.iloc[0:10:2] # every other rows
  • The .iloc (index location) attribute locates rows by index position.
    • This can be helpful when the position of rows has significance in our data set.
    • We pass integers.
  • The .iloc[:] is similar to the slicing syntax with strings/lists.
    • The end value is NOT inclusive.

Pandas Basics

Let’s do Questions 1-4 in Part 1 of Classwork 3!

Locating Values

Pandas Basics

Locating Values by loc[Rows, Columns] or iloc[Rows, Columns]

nba.loc[
    "LeBron James",
    "Team"
]

nba.loc[
     "James Harden", 
      ["Position", "Birthday"] 
]
nba.loc[
    ["Russell Westbrook", "Anthony Davis"],
     ["Team", "Salary"]
]

nba.loc[
    "Joel Embiid", 
    "Position":"Salary"
]
  • Both the .loc and .iloc attributes accept a second argument representing the column(s) to extract.
    • If we are using .loc, we have to provide the column names.

Pandas Basics

Locating Values by loc[Rows, Columns] or iloc[Rows, Columns]

nba.iloc[
    57, 
    3
]

nba.iloc[
    100:104, 
    :3
]
  • Both the .loc and .iloc attributes accept a second argument representing the column(s) to extract.
    • If we are using .iloc, we have to provide the column position.

Renaming columns or rows

Pandas Basics

Renaming columns with nba.columns

  • Do you recall the .columns attribute?
nba.columns
  • We can rename any or all of a DataFrame’s columns by assigning a list of new names to the attribute:
nba.columns = ["Team", "Position", "Date of Birth", "Income"]

Pandas Basics

Renaming columns with rename( columns = { "Existing One" : "New One" } )

nba.rename( columns = { "Date of Birth": "Birthday" } )
  • The above rename() method renames the variable Date of Birth to Birthday.

Pandas Basics

Renaming rows with rename( index = { "Existing One" : "New One" } )

nba = nba.rename(
    index = { "LeBron James": "LeBron Raymone James" }
)
  • The above rename() method renames the observation LeBron James to LeBron Raymone James.

Mathematical & Vectorized Operations

Pandas Basics

Mathematical Operations

nba.max()
nba.min()
  • The max() method returns a Series with the maximum value from each variable.
  • The min() method returns a Series with the minimum value from each variable.

Pandas Basics

Mathematical Operations

nba.sum()
nba.mean()
nba.median()
nba.quantile(0.75) # 0 to 1
nba.std()
nba.sum(numeric_only = True)
nba.mean(numeric_only = True)
nba.median(numeric_only = True)
nba.quantile(0.75, numeric_only=True)
nba.std(numeric_only = True)
  • The sum()/mean()/median() method returns a Series with the sum/mean/median of the values in each variable.
  • The quantile() method returns a Series with the percentile value of the values in each variable (e.g., 25th, 75th, 90th percentile).
  • The std() method returns a Series with the standard deviation of the values in each variable.
  • To limit the operation to numeric volumes, we can pass True to the sum()/mean()/median()/std() method’s numeric_only parameter.

Pandas Basics

Vectorized Operations

nba["Salary_2x"] = nba["Salary"] + nba["Salary"]
nba["Name_w_Position"] = nba["Name"] + " (" + nba["Position"] + ")"
nba["Salary_minus_Mean"] = nba["Salary"] - nba["Salary"].mean()
  • pandas performs a vectorized operation on Series or a variable in DataFrame.
    • This means an element-by-element operation.
    • This enables us to apply functions and perform operations on the data efficiently, without the need for explicit loops.

Converting Data Types with the astype() Method

Pandas Basics

  • Let’s read employment.csv as emp.
emp = pd.read_csv("https://bcdanl.github.io/data/employment.csv")
  • What values are in the Mgmt variable?

Pandas Basics

Converting Data Types with the astype() method

emp["Mgmt"].astype(bool)
  • The astype() method converts a Series’ values to a different data type.
    • It accepts a single argument: the new data type.
    • We can pass either the data type or a string with its name.

Pandas Basics

Converting Data Types with the astype() method

emp["Mgmt"] = emp["Mgmt"].astype(bool)
  • The above code overwrites the Mgmt variable with our new Series of Booleans.

Pandas Basics

Converting Data Types with the astype() method

emp["Salary"].astype(int)
  • The above code tries to coerce the Salary variable’s values to integers with the astype() method.
    • Pandas is unable to convert the NaN values to integers.

Pandas Basics

Fill Missing Values with the fillna() method

emp["Salary"].fillna(0)
  • The fillna() method replaces a Series’ missing values with the argument we pass in.
  • The above example provides a fill value of 0.
    • Note that your choice of value can distort the data; 0 is passed solely for the sake of example.

Pandas Basics

Fill Missing Values with the fillna() method

emp["Salary"] = emp["Salary"].fillna(0).astype(int)
  • The above code overwrites the Salary variable with our new Series of integers.

Pandas Basics

Converting Data Types with the astype() method

emp["Gender"] = emp["Gender"].astype("category")

Pandas includes a special data type called a category, - It is ideal for a variable consisting of a small number of unique values relative to its total size. - E.g., gender, weekdays, blood types, planets, and income groups.

Pandas Basics

Converting Data Types with the pd.to_datetime() method

emp["Start Date"] = pd.to_datetime(emp["Start Date"])
  • The pd.to_datetime() function is used to convert a Series, DataFrame, or a single variable of a DataFrame from its current data type into datetime format.

Pandas Basics

Let’s do Question 1 in Part 2 of Classwork 3!

Filtering by a Single Condition

Pandas Basics

Filtering by a Condition

  • We may often not know the index labels and positions of the observations we want to target.

  • We may want to target observations not by an index label but by a Boolean condition.

Pandas Basics

Filtering by a Single Condition

emp["First Name"] == "Donna"
  • To compare every value in Series with a constant value, we place the Series on one side of the equality operator (==) and the value on the other.
    • Series == value
  • The above example compares each First Name value with “Donna”.

Pandas Basics

Filtering by a Single Condition

emp[ emp["First Name"] == "Donna" ]
  • To filter observations, we provide the Boolean Series between square brackets following the DataFrame.
    • DataFrame[ Boolean_Series ]

Pandas Basics

Filtering by a Single Condition

donnas = emp["First Name"] == "Donna"
emp[donnas]
  • If the use of multiple square brackets is confusing, we can assign the Boolean Series to an object and then pass it into the square brackets instead.

Pandas Basics

Filtering by a Single Condition

What if we want to extract a subset of employees who are not on the “Marketing” team?

non_marketing = emp["Team"] != "Marketing"  # != means "not equal to"
emp[ non_marketing ]
  • True denotes that the Team value for a given index is not “Marketing”, and False indicates the Team value is “Marketing

Pandas Basics

Filtering by a Single Condition

What if we want to retrieve all the managers in the company? Managers have a value of True in the Mgmt variable.

emp[ emp["Mgmt"] ]
  • We could execute emp["Mgmt"] == True, but we do not need to.

Pandas Basics

Filtering by a Single Condition

high_earners = emp["Salary"] > 100000
emp[ high_earners ]
  • We can also use arithmetic operands to filter observations based on mathematical conditions.

Filtering by a Condition

Pandas Basics

Filtering by a Condition

sales = emp["Team"] == "Sales"
legal = emp["Team"] == "Legal"
fnce = emp["Team"] == "Finance"
emp[ sales | legal | fnce ]
  • We could provide three separate Boolean Series inside the square brackets and add the | symbol to declare OR criteria.

Pandas Basics

Filtering with the isin() method

star_teams = ["Sales", "Legal", "Finance"]
on_star_teams = emp["Team"].isin(star_teams)
emp[on_star_teams]
  • What if our next report asked for employees from 15 teams instead of three?

  • A better solution is the isin() method, which accepts an iterable (e.g., list, tuple, Series) and returns a Boolean Series.

Pandas Basics

Filtering by a Condition

  • When working with numbers or dates, we often want to extract values that fall within a range.
    • E.g., Identify all employees with a salary between $90,000 and $100,000.

Pandas Basics

Filtering by a Condition

higher_than_90k = emp["Salary"] >= 90000
lower_than_100k = emp["Salary"] < 100000
emp[ higher_than_90k & lower_than_100k ]
  • We can create two Boolean Series, one to declare the lower bound and one to declare the upper bound.

  • Then we can use the & operator to mandate that both conditions are True:

Pandas Basics

Filtering with the between() method

between_90k_and_100k = emp["Salary"].between(90000, 100000)
emp[between_90k_and_100k]
  • A slightly cleaner solution is to use a method called between()
    • It returns a Boolean Series where True denotes that a row’s value falls between the specified interval.
    • The first argument, the lower bound, is inclusive, and the second argument, the upper bound, is exclusive.

Pandas Basics

Filtering with the between() method

name_starts_with_t = emp["First Name"].between("T", "U")
emp[name_starts_with_t]
  • We can also apply the between() method to string variables.

Pandas Basics

Filtering by a Condition with the query() method

# Below two are equivalent
emp.query("Salary >= 100000 & Team == 'Finance'")
emp.query("Salary >= 100000 and Team == 'Finance'")
  • The query() method filters observations using a concise, string-based query syntax.

  • The query() method is often more readable and more concise, especially for complex conditions, compared to traditional Boolean filtering.

Pandas Basics

Let’s do Questions 2-6 in Part 2 of Classwork 3!

Pandas Basics

Referenes

  • Pandas in Action, Boris Paskhaver (Author), 2021, Manning