Lecture 9

pandas Basics - Mathematical & Vectorized Operations; Adding, Removing, & Renaming Variables; Data Types

Byeong-Hak Choe

SUNY Geneseo

February 17, 2025

`nba` DataFrame

Let’s read the nba.csv file as nba:

# Below is to import the pandas library as pd
import pandas as pd 

# Below is for an interactive display of DataFrame in Colab
from google.colab import data_table  
data_table.enable_dataframe_formatter()

# Below is to read nba.csv as nba DataFrame
nba = pd.read_csv("https://bcdanl.github.io/data/nba.csv",
                  parse_dates = ["Birthday"])

Mathematical & Vectorized Operations

Mathematical Operations

nba.max()
nba.min()

The max() method returns a Series with the maximum value from each variable.
The min() method returns a Series with the minimum value from each variable.

Mathematical Operations

nba.sum()
nba.mean()
nba.median()
nba.quantile(0.75) # 0 to 1
nba.std()

nba.sum(numeric_only = True)
nba.mean(numeric_only = True)
nba.median(numeric_only = True)
nba.quantile(0.75, numeric_only=True)
nba.std(numeric_only = True)

The sum()/mean()/median() method returns a Series with the sum/mean/median of the values in each variable.
The quantile() method returns a Series with the percentile value of the values in each variable (e.g., 25th, 75th, 90th percentile).
The std() method returns a Series with the standard deviation of the values in each variable.
To limit the operation to numeric volumes, we can pass True to the sum()/mean()/median()/std() method’s numeric_only parameter.

Vectorized Operations

nba["Salary_2x"] = nba["Salary"] + nba["Salary"]
nba["Name_w_Position"] = nba["Name"] + " (" + nba["Position"] + ")"
nba["Salary_minus_Mean"] = nba["Salary"] - nba["Salary"].mean()

pandas performs a vectorized operation on Series or a variable in DataFrame.
- This means an element-by-element operation.
- This enables us to apply functions and perform operations on the data efficiently, without the need for explicit loops.

Adding, Removing, and Renaming Variables

Adding and Removing Variables

Here we use [] to add variables:

nba['Salary_k'] = nba['Salary'] / 1000
nba['Salary_2x'] = nba['Salary'] + nba['Salary']
nba['Salary_3x'] = nba['Salary'] * 3

Removing Variables with `drop(columns = ... )`

We can use .drop(columns = ...) to drop variables:

nba.drop(columns = "Salary_k")
nba.drop(columns = ["Salary_2x", "Salary_3x"])

Renaming Variables with `nba.columns`

Do you recall the .columns attribute?

nba.columns

We can rename any or all of a DataFrame’s columns by assigning a list of new names to the attribute:

nba.columns = ["Team", "Position", "Date of Birth", "Income"]

Renaming Variables with `rename( columns = { "Existing One" : "New One" } )`

nba.rename( columns = { "Date of Birth": "Birthday" } )

The above rename() method renames the variable Date of Birth to Birthday.

Renaming rows with `rename( index = { "Existing One" : "New One" } )`

nba = nba.rename(
    index = { "LeBron James": "LeBron Raymone James" }
)

The above rename() method renames the observation LeBron James to LeBron Raymone James.

Converting Data Types with the `astype()` Method

Let’s read employment.csv as emp.

import pandas as pd
# Below is for an interactive display of DataFrame in Colab
from google.colab import data_table
data_table.enable_dataframe_formatter()

emp = pd.read_csv("https://bcdanl.github.io/data/employment.csv")

Converting Data Types with the `astype()` method

What values are in the Mgmt variable?

emp["Mgmt"].astype(bool)

The astype() method converts a Series’ values to a different data type.
- It can accept a single argument: the new data type.

Converting Data Types with the `astype()` method

emp["Mgmt"] = emp["Mgmt"].astype(bool)

The above code overwrites the Mgmt variable with our new Series of Booleans.

Converting Data Types with the `astype()` method

emp["Salary"].astype(int)

The above code tries to coerce the Salary variable’s values to integers with the astype() method.
- Pandas is unable to convert the NaN values to integers.

Fill Missing Values with the `fillna()` method

emp["Salary"].fillna(0)

The fillna() method replaces a Series’ missing values with the argument we pass in.
The above example provides a fill value of 0.
- Note that our choice of value can distort the data; 0 is passed solely for the sake of example.

Converting Data Types with the `astype()` method

emp["Salary"] = emp["Salary"].fillna(0).astype(int)

The above code overwrites the Salary variable with our new Series of integers.

Converting Data Types with the `astype()` method

emp["Gender"] = emp["Gender"].astype("category")

Pandas includes a special data type called a category,
- It is ideal for a variable consisting of a small number of unique values relative to its total number of values.
- E.g., gender, weekdays, blood types, planets, and income groups.

Converting Data Types with the `pd.to_datetime()` method

# Below two are equivalent:
emp["Start Date"] = pd.to_datetime(emp["Start Date"])
emp["Start Date"] = emp["Start Date"].astype('datetime64[ns]')

The pd.to_datetime() function is used to convert a Series, DataFrame, or a single variable of a DataFrame from its current data type into datetime format.

Converting Data Types with the `astype()` method

emp = pd.read_csv("https://bcdanl.github.io/data/employment.csv")

emp["Salary"] = emp["Salary"].fillna(0)
emp = emp.astype({'Mgmt': 'bool', 
                  'Salary': 'int',
                  'Gender': 'category',
                  'Start Date': 'datetime64[ns]',
                  'Team': 'category'})

We can provide a dictionary of variable-type pairs to astype().

Pandas Basics

Let’s do Question 1 in Classwork 6!

Lecture 9

nba DataFrame

Mathematical & Vectorized Operations

Mathematical Operations

Mathematical Operations

Vectorized Operations

Adding, Removing, and Renaming Variables

Adding and Removing Variables

Removing Variables with drop(columns = ... )

Renaming Variables with nba.columns

Renaming Variables with rename( columns = { "Existing One" : "New One" } )

Renaming rows with rename( index = { "Existing One" : "New One" } )

Converting Data Types with the astype() Method

Converting Data Types with the astype() Method

Converting Data Types with the astype() method

Converting Data Types with the astype() method

Converting Data Types with the astype() method

Fill Missing Values with the fillna() method

Converting Data Types with the astype() method

Converting Data Types with the astype() method

Converting Data Types with the pd.to_datetime() method

Converting Data Types with the astype() method

Pandas Basics

`nba` DataFrame

Removing Variables with `drop(columns = ... )`

Renaming Variables with `nba.columns`

Renaming Variables with `rename( columns = { "Existing One" : "New One" } )`

Renaming rows with `rename( index = { "Existing One" : "New One" } )`

Converting Data Types with the `astype()` Method

Converting Data Types with the `astype()` Method

Converting Data Types with the `astype()` method

Converting Data Types with the `astype()` method

Converting Data Types with the `astype()` method

Fill Missing Values with the `fillna()` method

Converting Data Types with the `astype()` method

Converting Data Types with the `astype()` method

Converting Data Types with the `pd.to_datetime()` method

Converting Data Types with the `astype()` method