Lecture 8

pandas Basics - Sorting Methods; Setting a New Index; Locating Observations/Values

Byeong-Hak Choe

SUNY Geneseo

February 14, 2025

`nba` DataFrame

Let’s read the nba.csv file as nba:

# Below is to import the pandas library as pd
import pandas as pd 

# Below is for an interactive display of DataFrame in Colab
from google.colab import data_table  
data_table.enable_dataframe_formatter()

# Below is to read nba.csv as nba DataFrame
nba = pd.read_csv("https://bcdanl.github.io/data/nba.csv",
                  parse_dates = ["Birthday"])

Sorting Methods

Sorting by a Single Variable with `sort_values()`

# The two lines below are equivalent
nba.sort_values(["Name"])
nba.sort_values(by = ["Name"])

The sort_values() method’s first parameter, by, accepts the variables that pandas should use to sort observations.

Sorting by a Single Variable with `sort_values()`

nba.sort_values(["Name"], ascending = False)

The sort_values() method’s ascending parameter determines the sort order.
- ascending has a default argument of True.
- By default, pandas will sort:
  - A variable of numbers in increasing order;
  - A variable of strings in alphabetical order;
  - A variable of datetimes in chronological order.

Pandas Basics

Method Chaining

(
    nba
    .sort_values(['Salary'])
    .head(5)
)

DataFrame has various methods that modify the existing DataFrame.
Method Chaining: We can call methods sequentially without the need to store intermediate results.

Sorting by a Single Variable with `nsmallest()` and `nlargest()`

nba.nsmallest(5, 'Salary')
nba.nlargest(5, 'Salary')

nsmallest() are useful to get the first n observations ordered by a variable in ascending order.
nlargest() are useful to get the first n observations ordered by a variable in descending order.

Sorting by a Single Variable with `nsmallest()` and `nlargest()`

nba.nsmallest(4, 'Salary', keep = "all")
nba.nlargest(4, 'Salary', keep = "all")

keep = "all" keeps all duplicates, even it means selecting more than n observations.

Sorting by Multiple Variables with `sort_values()`

nba.sort_values(["Team", "Name"])
nba.sort_values(by = ["Team", "Name"])

We can sort a DataFrame by multiple columns by passing a list to the by parameter.

Sorting by Multiple Variables with `sort_values()`

nba.sort_values(by = ["Team", "Name"], 
                ascending = False)

We can pass a single Boolean to the ascending parameter to apply the same sort order to each variable.

Sorting by Multiple Variables with `sort_values()`

nba.sort_values(by = ["Team", "Name"], 
                ascending = [False, True])

If we want to sort each variable in a different order, we can pass a Boolean list to the ascending parameter.

Sorting by Multiple Variables with `sort_values()`

Q. Which players on each team are paid the most?

Sorting by Row Index with `sort_index()`

# Below lines are equivalent
nba.sort_index()
nba.sort_index(ascending = True)

nba.sort_index(ascending = False)

If we assigned nba to nba DataFrame sorted by “Name”, how can we return it to its original form of DataFrame?
- Our nba DataFrame still has its numeric index labels.
- sort_index() sorts observations by their index labels (row names).

Relocating Variables with `sort_index()`

# The two lines below are equivalent
nba.sort_index(axis = "columns")
nba.sort_index(axis = 1)

The sort_index() method can also be used to change the order of variables in an alphabetical order.
- We need to add an axis parameter and pass it an argument of "columns" or 1.

Setting a New Index

We can use the set_index() method when we want to change the current index of a DataFrame to one or more existing columns.
- This is particularly useful when:
  - We have a column that uniquely identifies each observation (e.g., ID);
  - We sometimes want to use an unique identifier as the index for more efficient data wrangling.

Setting a New Index with `set_index()`

# The two lines below are equivalent
nba.set_index(keys = "Name")
nba.set_index("Name")

The set_index() method returns a new DataFrame with a given column set as the index.
- Its first parameter, keys, accepts the column name.

Re-setting an Index with `reset_index()`

nba2 = nba.set_index("Name")
nba2.reset_index(inplace=True)    # Useful for the method chaining

We use the reset_index() method:
- When we want to convert the index back into a DataFrame column;
- When we need to reset the index to the default integer index.
Note: With inplace=True, the operation alters the original DataFrame directly.

Locating Observations

Let’s read nba.csv as nba.

import pandas as pd
# Below is for an interactive display of DataFrame in Colab
from google.colab import data_table
data_table.enable_dataframe_formatter()

nba = pd.read_csv("https://bcdanl.github.io/data/nba.csv")

Locating Observations/Values

We can extract observations, variables, and values from a DataFrame by using the loc[] and iloc[] accessors.
- These accessors work well when we know the index labels and positions of the observations/variables we want to target.

Locating Observations by `.loc[Index Labels]`

Let’s consider the nba with the Name index.

# The two lines below are equivalent
nba = nba.set_index("Name")
nba.set_index("Name", inplace = True)

Below extracts observations:

nba.loc[ "LeBron James" ]
nba.loc[ ["Kawhi Leonard", "Paul George"] ]

The .loc attribute extracts an observation by index label (row name).

Locating Observations by `.loc[Index Labels]`

(
    nba
    .sort_index()
    .loc["Otto Porter":"Patrick Beverley"]
)

What is the above code doing?
- Note: Both the starting value and the ending value are inclusive.

Locating Observations by `.loc[Index Labels]`

(
    nba
    .sort_index()
    .loc["Zach Collins":]
)

(
    nba
    .sort_index()
    .loc[:"Al Horford"]
)

We can use loc[:] to pull rows:
- From the middle of the DataFrame to its end;
- From the beginning of the DataFrame to a specific index label.

Locating Observations by `.iloc[Index Positions]`

nba.iloc[ 300 ]
nba.iloc[ [100, 200, 300, 400] ]

nba.iloc[400:404]
nba.iloc[:2]
nba.iloc[447:]
nba.iloc[-10:-6]
nba.iloc[0:10:2] # every other rows

The .iloc (index location) attribute locates rows by index position.
- This can be helpful when the position of rows has significance in our data set.
- We pass integers.
The .iloc[:] is similar to the slicing syntax with strings/lists.
- The end value is NOT inclusive.

Pandas Basics

Let’s do Questions 4-7 in Classwork 5!

Locating Values

Locating Values by `loc[Rows, Columns]` or `iloc[Rows, Columns]`

nba.loc[
    "LeBron James",
    "Team"
]

nba.loc[
     "James Harden", 
      ["Position", "Birthday"] 
]

nba.loc[
    ["Russell Westbrook", "Anthony Davis"],
     ["Team", "Salary"]
]

nba.loc[
    "Joel Embiid", 
    "Position":"Salary"
]

Both the .loc and .iloc attributes accept a second argument representing the column(s) to extract.
- If we are using .loc, we have to provide the column names.

Locating Values by `loc[Rows, Columns]` or `iloc[Rows, Columns]`

nba.iloc[
    57, 
    3
]

nba.iloc[
    100:104, 
    :3
]

Both the .loc and .iloc attributes accept a second argument representing the column(s) to extract.
- If we are using .iloc, we have to provide the column position.

Lecture 8

nba DataFrame

Sorting Methods

Sorting by a Single Variable with sort_values()

Sorting by a Single Variable with sort_values()

Pandas Basics

Method Chaining

Sorting by a Single Variable with nsmallest() and nlargest()

Sorting by a Single Variable with nsmallest() and nlargest()

Sorting by Multiple Variables with sort_values()

Sorting by Multiple Variables with sort_values()

Sorting by Multiple Variables with sort_values()

Sorting by Multiple Variables with sort_values()

Sorting by Row Index with sort_index()

Relocating Variables with sort_index()

Setting a New Index

Setting a New Index

Setting a New Index with set_index()

Re-setting an Index with reset_index()

Locating Observations

Locating Observations

Locating Observations/Values

Locating Observations by .loc[Index Labels]

Locating Observations by .loc[Index Labels]

Locating Observations by .loc[Index Labels]

Locating Observations by .iloc[Index Positions]

Pandas Basics

Locating Values

Locating Values by loc[Rows, Columns] or iloc[Rows, Columns]

Locating Values by loc[Rows, Columns] or iloc[Rows, Columns]

`nba` DataFrame

Sorting by a Single Variable with `sort_values()`

Sorting by a Single Variable with `sort_values()`

Sorting by a Single Variable with `nsmallest()` and `nlargest()`

Sorting by a Single Variable with `nsmallest()` and `nlargest()`

Sorting by Multiple Variables with `sort_values()`

Sorting by Multiple Variables with `sort_values()`

Sorting by Multiple Variables with `sort_values()`

Sorting by Multiple Variables with `sort_values()`

Sorting by Row Index with `sort_index()`

Relocating Variables with `sort_index()`

Setting a New Index with `set_index()`

Re-setting an Index with `reset_index()`

Locating Observations by `.loc[Index Labels]`

Locating Observations by `.loc[Index Labels]`

Locating Observations by `.loc[Index Labels]`

Locating Observations by `.iloc[Index Positions]`

Locating Values by `loc[Rows, Columns]` or `iloc[Rows, Columns]`

Locating Values by `loc[Rows, Columns]` or `iloc[Rows, Columns]`