Lecture 9

pandas Basics - Mathematical & Vectorized Operations; Adding, Removing, & Renaming Variables; Data Types

Byeong-Hak Choe

SUNY Geneseo

February 17, 2025

nba DataFrame

  • Let’s read the nba.csv file as nba:
# Below is to import the pandas library as pd
import pandas as pd 

# Below is for an interactive display of DataFrame in Colab
from google.colab import data_table  
data_table.enable_dataframe_formatter()

# Below is to read nba.csv as nba DataFrame
nba = pd.read_csv("https://bcdanl.github.io/data/nba.csv",
                  parse_dates = ["Birthday"])

Mathematical & Vectorized Operations

Mathematical Operations

nba.max()
nba.min()
  • The max() method returns a Series with the maximum value from each variable.
  • The min() method returns a Series with the minimum value from each variable.

Mathematical Operations

nba.sum()
nba.mean()
nba.median()
nba.quantile(0.75) # 0 to 1
nba.std()
nba.sum(numeric_only = True)
nba.mean(numeric_only = True)
nba.median(numeric_only = True)
nba.quantile(0.75, numeric_only=True)
nba.std(numeric_only = True)
  • The sum()/mean()/median() method returns a Series with the sum/mean/median of the values in each variable.
  • The quantile() method returns a Series with the percentile value of the values in each variable (e.g., 25th, 75th, 90th percentile).
  • The std() method returns a Series with the standard deviation of the values in each variable.
  • To limit the operation to numeric volumes, we can pass True to the sum()/mean()/median()/std() method’s numeric_only parameter.

Vectorized Operations

nba["Salary_2x"] = nba["Salary"] + nba["Salary"]
nba["Name_w_Position"] = nba["Name"] + " (" + nba["Position"] + ")"
nba["Salary_minus_Mean"] = nba["Salary"] - nba["Salary"].mean()
  • pandas performs a vectorized operation on Series or a variable in DataFrame.
    • This means an element-by-element operation.
    • This enables us to apply functions and perform operations on the data efficiently, without the need for explicit loops.

Adding, Removing, and Renaming Variables

Adding and Removing Variables

  • Here we use [] to add variables:
nba['Salary_k'] = nba['Salary'] / 1000
nba['Salary_2x'] = nba['Salary'] + nba['Salary']
nba['Salary_3x'] = nba['Salary'] * 3

Removing Variables with drop(columns = ... )

  • We can use .drop(columns = ...) to drop variables:
nba.drop(columns = "Salary_k")
nba.drop(columns = ["Salary_2x", "Salary_3x"])

Renaming Variables with nba.columns

  • Do you recall the .columns attribute?
nba.columns
  • We can rename any or all of a DataFrame’s columns by assigning a list of new names to the attribute:
nba.columns = ["Team", "Position", "Date of Birth", "Income"]

Renaming Variables with rename( columns = { "Existing One" : "New One" } )

nba.rename( columns = { "Date of Birth": "Birthday" } )
  • The above rename() method renames the variable Date of Birth to Birthday.

Renaming rows with rename( index = { "Existing One" : "New One" } )

nba = nba.rename(
    index = { "LeBron James": "LeBron Raymone James" }
)
  • The above rename() method renames the observation LeBron James to LeBron Raymone James.

Converting Data Types with the astype() Method

Converting Data Types with the astype() Method

  • Let’s read employment.csv as emp.
import pandas as pd
# Below is for an interactive display of DataFrame in Colab
from google.colab import data_table
data_table.enable_dataframe_formatter()

emp = pd.read_csv("https://bcdanl.github.io/data/employment.csv")

Converting Data Types with the astype() method

  • What values are in the Mgmt variable?
emp["Mgmt"].astype(bool)
  • The astype() method converts a Series’ values to a different data type.
    • It can accept a single argument: the new data type.

Converting Data Types with the astype() method

emp["Mgmt"] = emp["Mgmt"].astype(bool)
  • The above code overwrites the Mgmt variable with our new Series of Booleans.

Converting Data Types with the astype() method

emp["Salary"].astype(int)
  • The above code tries to coerce the Salary variable’s values to integers with the astype() method.
    • Pandas is unable to convert the NaN values to integers.

Fill Missing Values with the fillna() method

emp["Salary"].fillna(0)
  • The fillna() method replaces a Series’ missing values with the argument we pass in.
  • The above example provides a fill value of 0.
    • Note that our choice of value can distort the data; 0 is passed solely for the sake of example.

Converting Data Types with the astype() method

emp["Salary"] = emp["Salary"].fillna(0).astype(int)
  • The above code overwrites the Salary variable with our new Series of integers.

Converting Data Types with the astype() method

emp["Gender"] = emp["Gender"].astype("category")
  • Pandas includes a special data type called a category,
    • It is ideal for a variable consisting of a small number of unique values relative to its total number of values.
    • E.g., gender, weekdays, blood types, planets, and income groups.

Converting Data Types with the pd.to_datetime() method

# Below two are equivalent:
emp["Start Date"] = pd.to_datetime(emp["Start Date"])
emp["Start Date"] = emp["Start Date"].astype('datetime64[ns]')
  • The pd.to_datetime() function is used to convert a Series, DataFrame, or a single variable of a DataFrame from its current data type into datetime format.

Converting Data Types with the astype() method

emp = pd.read_csv("https://bcdanl.github.io/data/employment.csv")

emp["Salary"] = emp["Salary"].fillna(0)
emp = emp.astype({'Mgmt': 'bool', 
                  'Salary': 'int',
                  'Gender': 'category',
                  'Start Date': 'datetime64[ns]',
                  'Team': 'category'})
  • We can provide a dictionary of variable-type pairs to astype().

Pandas Basics

Let’s do Question 1 in Classwork 6!