Lecture 6

Pandas Fundamentals I: Loading, Summarizing, and Counting Data

Byeong-Hak Choe

bchoe@geneseo.edu

SUNY Geneseo

March 23, 2026

✅ Learning Objectives

Learning Objectives

Loading DataFrame with read_csv()
Getting a Summary with info() and describe()
Mathematical & Vectorized Operations
Adding, Removing, and Renaming Variables
Selecting and Relocating Variables with []
Counting Values with value_counts(), nunique(), and count()
Sorting with sort_values() and sort_index()
Indexing with set_index() and reset_index()
Locating Observations and Values with loc[] and iloc[]
Converting Data Types with .astype()
Filtering Observations

Learning Objectives

Dealing with Missing Values
Dealing wit Duplicates
Reshaping DataFrames with .melt() and .pivot()
Joining DataFrames with .merge()

🐼 Pandas `Series` and `DataFrame`

Pandas `Series` and `DataFrame`

Series: A one-dimensional object containing a sequence of values (like a list).
DataFrame: A two-dimensional table made of multiple Series columns sharing a common index.

🧐 Observations in `DataFrame`

Rows in a DataFrame represent individual units or entities for which data is collected.
Examples:
- Student Information: Each row = one student
- Employee Information: Each row = one employee
- Daily S&P 500 Index Data: Each row = one trading day
- Household Survey Data: Each row = one household

🏷️ Variables in `DataFrame`

Columns in a DataFrame represent attributes or characteristics measured across multiple observations.
Examples:
- Student Data: Name, Age, Grade, Major
- Employee Data: EmployeeID, Name, Age, Department
- Customer Data: CustomerID, Name, Age, Income, HousingType

Note

In a DataFrame, a variable is a column of data.
In general programming, a variable is the name of an object.

✨ Tidy `DataFrame`

Variables, Observations, and Values

A DataFrame is tidy if it follows three rules:
1. Each variable has its own column.
2. Each observation has its own row.
3. Each value has its own cell.
A tidy DataFrame keeps your data organized, making it easier to understand, analyze, and share in any data analysis.

📥 Loading Data

Importing a data set with `read_csv()`

A CSV (comma-separated values) is a plain-text file that uses a comma to separate values (e.g., nba.csv).
The CSV is widely used for storing data, and we will use this throughout the module.
We use the read_csv() function to load a CSV data file.

import pandas as pd
nba = pd.read_csv("https://bcdanl.github.io/data/nba.csv")
type(nba)
nba

`read_csv()` with `parse_dates`

nba = pd.read_csv("https://bcdanl.github.io/data/nba.csv",
                  parse_dates = ["Birthday"])
nba

We can use the parse_dates parameter to coerce the values into datetimes.

Mounting Google Drive on Google Colab

from google.colab import drive, files
drive.mount('/content/drive')
files.upload()

drive.mount('/content/drive')
- To mount your Google Drive on Google colab:
files.upload()
- To initiate uploading a file on Google Drive:

To find a pathname of a CSV file in Google Drive:
- Click 📁 from the sidebar menu
- drive ➡️ MyDrive …
- Hover a mouse cursor on the CSV file
- Click the vertical dots
- Click “Copy path”

Colab’s Interactive DataFrame Display

from google.colab import data_table
data_table.enable_dataframe_formatter()  # Enabling an interactive DataFrame display

nba

Colab includes an extension that renders pandas DataFrames into interactive displays.

`nba` DataFrame

Let’s read the nba.csv file as nba:

# Below is to import the pandas library as pd
import pandas as pd 

# Below is to import the numpy library as np
import numpy as np

# Below is for an interactive display of DataFrame in Colab
from google.colab import data_table
data_table.enable_dataframe_formatter()

# Below is to read nba.csv as nba DataFrame
nba = pd.read_csv("https://bcdanl.github.io/data/nba.csv",
                  parse_dates = ["Birthday"])

📋 Getting a Summary of Data

⚙️ Dot Operators, Methods, and Attributes

⚫ Dot operator

The dot operator (DataFrame.) is used for an attribute or a method on objects.

🛠️ Method

A method (DataFrame.METHOD()) is a function that we can call on a DataFrame to perform operations, modify data, or derive insights.
- e.g., df.info()

🏷️ Attribute

An attribute (DataFrame.ATTRIBUTE) is a property that provides information about the DataFrame’s structure or content without modifying it.
- e.g., df.columns

📑 Getting a Summary of a `DataFrame`

nba.info()    # method
nba.count()   # method

nba.shape     # attribute
nba.columns   # attribute

Every DataFrame object has a .info() method that provides a summary of a DataFrame:
- Variable names (.columns)
- Number of observations and variables (.shape)
- Number of non-missing values in each variable (.count())
  - Pandas often displays missing values as NaN.

Getting a Descriptive Statistics of a `DataFrame` with `.describe()`

nba.describe()

nba.describe(include='all')

.describe() method generates descriptive statistics that summarize the central tendency, dispersion, and distribution of each variable.
- It can also process string-type variables if specified explicitly (include='all').

⚙️ Mathematical & Vectorized Operations

Mathematical Operations

nba.max()
nba.min()

The max() method returns a Series with the maximum value from each variable.
The min() method returns a Series with the minimum value from each variable.

Mathematical Operations

nba.sum()
nba.mean()
nba.median()
nba.quantile(0.75) # 0 to 1
nba.std()

nba.sum(numeric_only = True)
nba.mean(numeric_only = True)
nba.median(numeric_only = True)
nba.quantile(0.75, numeric_only=True)
nba.std(numeric_only = True)

The sum()/mean()/median() method returns a Series with the sum/mean/median of the values in each variable.
The quantile() method returns a Series with the percentile value of the values in each variable (e.g., 25th, 75th, 90th percentile).
The std() method returns a Series with the standard deviation of the values in each variable.
To limit the operation to numeric volumes, we can pass True to the sum()/mean()/median()/std() method’s numeric_only parameter.

Vectorized Operations

nba["Salary"] + nba["Salary"]
nba["Name"] + " (" + nba["Position"] + ")"
nba["Salary"] - nba["Salary"].mean()

pandas performs a vectorized operation on Series or a variable in DataFrame.
- This means an element-by-element operation.
- This enables us to apply functions and perform operations on the data efficiently, without the need for explicit loops.

✏️️ Adding, Removing, and Renaming Variables

Adding and Removing Variables

Here we use [] to add variables:

nba['Salary_k'] = nba['Salary'] / 1000
nba['Salary_2x'] = nba['Salary'] + nba['Salary']
nba['Salary_3x'] = nba['Salary'] * 3

Removing Variables with `drop(columns = ...)`

We can use .drop(columns = ...) to drop variables:

nba.drop(columns = "Salary_k")
nba.drop(columns = ["Salary_2x", "Salary_3x"])

Renaming Variables with `nba.columns`

Do you recall the .columns attribute?

nba.columns

We can rename any or all of a DataFrame’s columns by assigning a list of new names to the attribute:

nba.columns = ["Team", "Position", "Date of Birth", "Income"]

Renaming Variables with `rename( columns = { "Existing One" : "New One" } )`

nba.rename( columns = { "Date of Birth": "Birthday" } )

The above rename() method renames the variable Date of Birth to Birthday.

🛠️ Selecting, Relocating, Adding, Removing, and Renaming Variables

Selecting a Variable by its Name

`Series`

nba_player_name_s = nba['Name']
nba_player_name_s

`DataFrame`

nba_player_name_df = nba[ ['Name'] ]
nba_player_name_df

If we want only a specific variable from a DataFrame, we can access the variable with its name using squared brackets, [ ].
- DataFrame[ 'var_1' ]
- DataFrame[ ['var_1'] ]

Selecting & Relocating Multiple Variables by their Names

nba_player_name_team = nba[ ['Name', 'Team'] ]
nba_player_team_name = nba[ ['Team', 'Name'] ]

In order to specify multiple variables by their names, we need to pass in a Python list between the square brackets.
- DataFrame[ ['var_1', 'var_2', ... ] ]
- This is also how we can relocate variables by the order specified in the list.

🔢 Counting Methods

Counting with `.count()`

nba['Salary'].count()
nba[['Salary']].count()

The Series.count() counts the number of non-missing values in a single value.
The DataFrame.count() counts the number of non-missing values in a Series.

Counting with `.nunique()`

nfl['Team'].nunique()
nba[['Team']].nunique()

nba.nunique()

The Series.nunique() counts the number of unique values in a single value integer.
The DataFrame.nunique() counts the number of unique values in each variable in a DataFrame, returning a Series.

Counting with `.value_counts()`

nba['Team'].value_counts()

nba[['Team']].value_counts()

The .value_counts() counts the number of occurrences of each unique value in a Series.

A Quick Comparison

Function	What it does
`.count()`	Counts non-missing values
`.nunique()`	Counts distinct values
`.value_counts()`	Counts frequencies of unique values

🚀 Classwork 10: Pandas Fundamentals

Let’s do Classwork 10!

Lecture 6

✅ Learning Objectives

Learning Objectives

Learning Objectives

🐼 Pandas Series and DataFrame

Pandas Series and DataFrame

🧐 Observations in DataFrame

🏷️ Variables in DataFrame

✨ Tidy DataFrame

Variables, Observations, and Values

📥 Loading Data

Importing a data set with read_csv()

read_csv() with parse_dates

Mounting Google Drive on Google Colab

Colab’s Interactive DataFrame Display

nba DataFrame

📋 Getting a Summary of Data

⚙️ Dot Operators, Methods, and Attributes

⚫ Dot operator

🛠️ Method

🏷️ Attribute

📑 Getting a Summary of a DataFrame

Getting a Descriptive Statistics of a DataFrame with .describe()

⚙️ Mathematical & Vectorized Operations

Mathematical Operations

Mathematical Operations

Vectorized Operations

✏️️ Adding, Removing, and Renaming Variables

Adding and Removing Variables

Removing Variables with drop(columns = ...)

Renaming Variables with nba.columns

Renaming Variables with rename( columns = { "Existing One" : "New One" } )

🛠️ Selecting, Relocating, Adding, Removing, and Renaming Variables

Selecting a Variable by its Name

Series

DataFrame

Selecting & Relocating Multiple Variables by their Names

🔢 Counting Methods

Counting with .count()

Counting with .nunique()

Counting with .value_counts()

A Quick Comparison

🚀 Classwork 10: Pandas Fundamentals

🐼 Pandas `Series` and `DataFrame`

Pandas `Series` and `DataFrame`

🧐 Observations in `DataFrame`

🏷️ Variables in `DataFrame`

✨ Tidy `DataFrame`

Importing a data set with `read_csv()`

`read_csv()` with `parse_dates`

`nba` DataFrame

📑 Getting a Summary of a `DataFrame`

Getting a Descriptive Statistics of a `DataFrame` with `.describe()`

Removing Variables with `drop(columns = ...)`

Renaming Variables with `nba.columns`

Renaming Variables with `rename( columns = { "Existing One" : "New One" } )`

`Series`

`DataFrame`

Counting with `.count()`

Counting with `.nunique()`

Counting with `.value_counts()`