Lecture 6

Pandas Fundamentals I: Loading, Summarizing, and Counting Data

Byeong-Hak Choe

SUNY Geneseo

March 23, 2026

Learning Objectives

Learning Objectives

  • Loading DataFrame with read_csv()
  • Getting a Summary with info() and describe()
  • Mathematical & Vectorized Operations
  • Adding, Removing, and Renaming Variables
  • Selecting and Relocating Variables with []
  • Counting Values with value_counts(), nunique(), and count()
  • Sorting with sort_values() and sort_index()
  • Indexing with set_index() and reset_index()
  • Locating Observations and Values with loc[] and iloc[]
  • Converting Data Types with .astype()
  • Filtering Observations

Learning Objectives

  • Dealing with Missing Values
  • Dealing wit Duplicates
  • Reshaping DataFrames with .melt() and .pivot()
  • Joining DataFrames with .merge()

🐼 Pandas Series and DataFrame

Pandas Series and DataFrame

  • Series: A one-dimensional object containing a sequence of values (like a list).

  • DataFrame: A two-dimensional table made of multiple Series columns sharing a common index.

🧐 Observations in DataFrame

  • Rows in a DataFrame represent individual units or entities for which data is collected.
  • Examples:
    • Student Information: Each row = one student
    • Employee Information: Each row = one employee
    • Daily S&P 500 Index Data: Each row = one trading day
    • Household Survey Data: Each row = one household

🏷️ Variables in DataFrame

  • Columns in a DataFrame represent attributes or characteristics measured across multiple observations.
  • Examples:
    • Student Data: Name, Age, Grade, Major
    • Employee Data: EmployeeID, Name, Age, Department
    • Customer Data: CustomerID, Name, Age, Income, HousingType

Note

  • In a DataFrame, a variable is a column of data.
  • In general programming, a variable is the name of an object.

✨ Tidy DataFrame

Variables, Observations, and Values

  • A DataFrame is tidy if it follows three rules:

    1. Each variable has its own column.
    2. Each observation has its own row.
    3. Each value has its own cell.
  • A tidy DataFrame keeps your data organized, making it easier to understand, analyze, and share in any data analysis.

📥 Loading Data

Importing a data set with read_csv()

  • A CSV (comma-separated values) is a plain-text file that uses a comma to separate values (e.g., nba.csv).

  • The CSV is widely used for storing data, and we will use this throughout the module.

  • We use the read_csv() function to load a CSV data file.

import pandas as pd
nba = pd.read_csv("https://bcdanl.github.io/data/nba.csv")
type(nba)
nba

read_csv() with parse_dates

nba = pd.read_csv("https://bcdanl.github.io/data/nba.csv",
                  parse_dates = ["Birthday"])
nba
  • We can use the parse_dates parameter to coerce the values into datetimes.

Mounting Google Drive on Google Colab

from google.colab import drive, files
drive.mount('/content/drive')
files.upload()
  • drive.mount('/content/drive')
    • To mount your Google Drive on Google colab:
  • files.upload()
    • To initiate uploading a file on Google Drive:
  • To find a pathname of a CSV file in Google Drive:
    • Click 📁 from the sidebar menu
    • drive ➡️ MyDrive
    • Hover a mouse cursor on the CSV file
    • Click the vertical dots
    • Click “Copy path”

Colab’s Interactive DataFrame Display

from google.colab import data_table
data_table.enable_dataframe_formatter()  # Enabling an interactive DataFrame display

nba
  • Colab includes an extension that renders pandas DataFrames into interactive displays.

nba DataFrame

  • Let’s read the nba.csv file as nba:
# Below is to import the pandas library as pd
import pandas as pd 

# Below is to import the numpy library as np
import numpy as np

# Below is for an interactive display of DataFrame in Colab
from google.colab import data_table
data_table.enable_dataframe_formatter()

# Below is to read nba.csv as nba DataFrame
nba = pd.read_csv("https://bcdanl.github.io/data/nba.csv",
                  parse_dates = ["Birthday"])

📋 Getting a Summary of Data

⚙️ Dot Operators, Methods, and Attributes

⚫ Dot operator

  • The dot operator (DataFrame.) is used for an attribute or a method on objects.

🛠️ Method

  • A method (DataFrame.METHOD()) is a function that we can call on a DataFrame to perform operations, modify data, or derive insights.
    • e.g., df.info()

🏷️ Attribute

  • An attribute (DataFrame.ATTRIBUTE) is a property that provides information about the DataFrame’s structure or content without modifying it.
    • e.g., df.columns

📑 Getting a Summary of a DataFrame

nba.info()    # method
nba.count()   # method
nba.shape     # attribute
nba.columns   # attribute
  • Every DataFrame object has a .info() method that provides a summary of a DataFrame:
    • Variable names (.columns)
    • Number of observations and variables (.shape)
    • Number of non-missing values in each variable (.count())
      • Pandas often displays missing values as NaN.

Getting a Descriptive Statistics of a DataFrame with .describe()

nba.describe()
nba.describe(include='all')
  • .describe() method generates descriptive statistics that summarize the central tendency, dispersion, and distribution of each variable.
    • It can also process string-type variables if specified explicitly (include='all').

⚙️ Mathematical & Vectorized Operations

Mathematical Operations

nba.max()
nba.min()
  • The max() method returns a Series with the maximum value from each variable.
  • The min() method returns a Series with the minimum value from each variable.

Mathematical Operations

nba.sum()
nba.mean()
nba.median()
nba.quantile(0.75) # 0 to 1
nba.std()
nba.sum(numeric_only = True)
nba.mean(numeric_only = True)
nba.median(numeric_only = True)
nba.quantile(0.75, numeric_only=True)
nba.std(numeric_only = True)
  • The sum()/mean()/median() method returns a Series with the sum/mean/median of the values in each variable.
  • The quantile() method returns a Series with the percentile value of the values in each variable (e.g., 25th, 75th, 90th percentile).
  • The std() method returns a Series with the standard deviation of the values in each variable.
  • To limit the operation to numeric volumes, we can pass True to the sum()/mean()/median()/std() method’s numeric_only parameter.

Vectorized Operations

nba["Salary"] + nba["Salary"]
nba["Name"] + " (" + nba["Position"] + ")"
nba["Salary"] - nba["Salary"].mean()
  • pandas performs a vectorized operation on Series or a variable in DataFrame.
    • This means an element-by-element operation.
    • This enables us to apply functions and perform operations on the data efficiently, without the need for explicit loops.

✏️️ Adding, Removing, and Renaming Variables

Adding and Removing Variables

  • Here we use [] to add variables:
nba['Salary_k'] = nba['Salary'] / 1000
nba['Salary_2x'] = nba['Salary'] + nba['Salary']
nba['Salary_3x'] = nba['Salary'] * 3

Removing Variables with drop(columns = ...)

  • We can use .drop(columns = ...) to drop variables:
nba.drop(columns = "Salary_k")
nba.drop(columns = ["Salary_2x", "Salary_3x"])

Renaming Variables with nba.columns

  • Do you recall the .columns attribute?
nba.columns
  • We can rename any or all of a DataFrame’s columns by assigning a list of new names to the attribute:
nba.columns = ["Team", "Position", "Date of Birth", "Income"]

Renaming Variables with rename( columns = { "Existing One" : "New One" } )

nba.rename( columns = { "Date of Birth": "Birthday" } )
  • The above rename() method renames the variable Date of Birth to Birthday.

🛠️ Selecting, Relocating, Adding, Removing, and Renaming Variables

Selecting a Variable by its Name

Series

nba_player_name_s = nba['Name']
nba_player_name_s

DataFrame

nba_player_name_df = nba[ ['Name'] ]
nba_player_name_df
  • If we want only a specific variable from a DataFrame, we can access the variable with its name using squared brackets, [ ].
    • DataFrame[ 'var_1' ]
    • DataFrame[ ['var_1'] ]

Selecting & Relocating Multiple Variables by their Names

nba_player_name_team = nba[ ['Name', 'Team'] ]
nba_player_team_name = nba[ ['Team', 'Name'] ]
  • In order to specify multiple variables by their names, we need to pass in a Python list between the square brackets.
    • DataFrame[ ['var_1', 'var_2', ... ] ]
    • This is also how we can relocate variables by the order specified in the list.

🔢 Counting Methods

Counting with .count()

nba['Salary'].count()
nba[['Salary']].count()
  • The Series.count() counts the number of non-missing values in a single value.
  • The DataFrame.count() counts the number of non-missing values in a Series.

Counting with .value_counts()

nba['Team'].value_counts()
nba[['Team']].value_counts()
  • The .value_counts() counts the number of occurrences of each unique value in a Series.

Counting with .nunique()

nfl['Team'].nunique()
nba[['Team']].nunique()
nba.nunique()
  • The Series.nunique() counts the number of unique values in a single value integer.
  • The DataFrame.nunique() counts the number of unique values in each variable in a DataFrame, returning a Series.

🚀 Classwork 10: Pandas Fundamentals

Let’s do Classwork 10!