Lecture 8

pandas Basics - Sorting Methods; Setting a New Index; Locating Observations/Values

Byeong-Hak Choe

SUNY Geneseo

February 14, 2025

nba DataFrame

  • Let’s read the nba.csv file as nba:
# Below is to import the pandas library as pd
import pandas as pd 

# Below is for an interactive display of DataFrame in Colab
from google.colab import data_table  
data_table.enable_dataframe_formatter()

# Below is to read nba.csv as nba DataFrame
nba = pd.read_csv("https://bcdanl.github.io/data/nba.csv",
                  parse_dates = ["Birthday"])

Sorting Methods

Sorting by a Single Variable with sort_values()

# The two lines below are equivalent
nba.sort_values(["Name"])
nba.sort_values(by = ["Name"])
  • The sort_values() method’s first parameter, by, accepts the variables that pandas should use to sort observations.

Sorting by a Single Variable with sort_values()

nba.sort_values(["Name"], ascending = False)
  • The sort_values() method’s ascending parameter determines the sort order.
    • ascending has a default argument of True.
    • By default, pandas will sort:
      • A variable of numbers in increasing order;
      • A variable of strings in alphabetical order;
      • A variable of datetimes in chronological order.

Pandas Basics

Method Chaining

(
    nba
    .sort_values(['Salary'])
    .head(5)
)
  • DataFrame has various methods that modify the existing DataFrame.
  • Method Chaining: We can call methods sequentially without the need to store intermediate results.

Sorting by a Single Variable with nsmallest() and nlargest()

nba.nsmallest(5, 'Salary')
nba.nlargest(5, 'Salary')
  • nsmallest() are useful to get the first n observations ordered by a variable in ascending order.

  • nlargest() are useful to get the first n observations ordered by a variable in descending order.

Sorting by a Single Variable with nsmallest() and nlargest()

nba.nsmallest(4, 'Salary', keep = "all")
nba.nlargest(4, 'Salary', keep = "all")
  • keep = "all" keeps all duplicates, even it means selecting more than n observations.

Sorting by Multiple Variables with sort_values()

nba.sort_values(["Team", "Name"])
nba.sort_values(by = ["Team", "Name"])
  • We can sort a DataFrame by multiple columns by passing a list to the by parameter.

Sorting by Multiple Variables with sort_values()

nba.sort_values(by = ["Team", "Name"], 
                ascending = False)
  • We can pass a single Boolean to the ascending parameter to apply the same sort order to each variable.

Sorting by Multiple Variables with sort_values()

nba.sort_values(by = ["Team", "Name"], 
                ascending = [False, True])
  • If we want to sort each variable in a different order, we can pass a Boolean list to the ascending parameter.

Sorting by Multiple Variables with sort_values()

Q. Which players on each team are paid the most?

Sorting by Row Index with sort_index()

# Below lines are equivalent
nba.sort_index()
nba.sort_index(ascending = True)
nba.sort_index(ascending = False)
  • If we assigned nba to nba DataFrame sorted by “Name”, how can we return it to its original form of DataFrame?
    • Our nba DataFrame still has its numeric index labels.
    • sort_index() sorts observations by their index labels (row names).

Relocating Variables with sort_index()

# The two lines below are equivalent
nba.sort_index(axis = "columns")
nba.sort_index(axis = 1)
  • The sort_index() method can also be used to change the order of variables in an alphabetical order.
    • We need to add an axis parameter and pass it an argument of "columns" or 1.

Setting a New Index

Setting a New Index

  • We can use the set_index() method when we want to change the current index of a DataFrame to one or more existing columns.
    • This is particularly useful when:
      • We have a column that uniquely identifies each observation (e.g., ID);
      • We sometimes want to use an unique identifier as the index for more efficient data wrangling.

Setting a New Index with set_index()

# The two lines below are equivalent
nba.set_index(keys = "Name")
nba.set_index("Name")
  • The set_index() method returns a new DataFrame with a given column set as the index.
    • Its first parameter, keys, accepts the column name.

Re-setting an Index with reset_index()

nba2 = nba.set_index("Name")
nba2.reset_index(inplace=True)    # Useful for the method chaining
  • We use the reset_index() method:
    • When we want to convert the index back into a DataFrame column;
    • When we need to reset the index to the default integer index.
  • Note: With inplace=True, the operation alters the original DataFrame directly.

Locating Observations

Locating Observations

  • Let’s read nba.csv as nba.
import pandas as pd
# Below is for an interactive display of DataFrame in Colab
from google.colab import data_table
data_table.enable_dataframe_formatter()

nba = pd.read_csv("https://bcdanl.github.io/data/nba.csv")

Locating Observations/Values

  • We can extract observations, variables, and values from a DataFrame by using the loc[] and iloc[] accessors.

    • These accessors work well when we know the index labels and positions of the observations/variables we want to target.

Locating Observations by .loc[Index Labels]

  • Let’s consider the nba with the Name index.
# The two lines below are equivalent
nba = nba.set_index("Name")
nba.set_index("Name", inplace = True)
  • Below extracts observations:
nba.loc[ "LeBron James" ]
nba.loc[ ["Kawhi Leonard", "Paul George"] ]
  • The .loc attribute extracts an observation by index label (row name).

Locating Observations by .loc[Index Labels]

(
    nba
    .sort_index()
    .loc["Otto Porter":"Patrick Beverley"]
)
  • What is the above code doing?
    • Note: Both the starting value and the ending value are inclusive.

Locating Observations by .loc[Index Labels]

(
    nba
    .sort_index()
    .loc["Zach Collins":]
)
(
    nba
    .sort_index()
    .loc[:"Al Horford"]
)
  • We can use loc[:] to pull rows:
    • From the middle of the DataFrame to its end;
    • From the beginning of the DataFrame to a specific index label.

Locating Observations by .iloc[Index Positions]

nba.iloc[ 300 ]
nba.iloc[ [100, 200, 300, 400] ]
nba.iloc[400:404]
nba.iloc[:2]
nba.iloc[447:]
nba.iloc[-10:-6]
nba.iloc[0:10:2] # every other rows
  • The .iloc (index location) attribute locates rows by index position.
    • This can be helpful when the position of rows has significance in our data set.
    • We pass integers.
  • The .iloc[:] is similar to the slicing syntax with strings/lists.
    • The end value is NOT inclusive.

Pandas Basics

Let’s do Questions 4-7 in Classwork 5!

Locating Values

Locating Values by loc[Rows, Columns] or iloc[Rows, Columns]

nba.loc[
    "LeBron James",
    "Team"
]

nba.loc[
     "James Harden", 
      ["Position", "Birthday"] 
]
nba.loc[
    ["Russell Westbrook", "Anthony Davis"],
     ["Team", "Salary"]
]

nba.loc[
    "Joel Embiid", 
    "Position":"Salary"
]
  • Both the .loc and .iloc attributes accept a second argument representing the column(s) to extract.
    • If we are using .loc, we have to provide the column names.

Locating Values by loc[Rows, Columns] or iloc[Rows, Columns]

nba.iloc[
    57, 
    3
]

nba.iloc[
    100:104, 
    :3
]
  • Both the .loc and .iloc attributes accept a second argument representing the column(s) to extract.
    • If we are using .iloc, we have to provide the column position.