Lecture 2

Pandas Basics - Loading, Summarizing, Selecting, Counting, Sorting, and Indexing Data

Byeong-Hak Choe

bchoe@geneseo.edu

SUNY Geneseo

February 13, 2024

Pandas Basics

Learning Objectives

Loading DataFrame
Getting a Summary of DataFrame
Selecting Columns in a DataFrame
Counting Values in a DataFrame
Sorting DataFrame
Indexing DataFrame

Loading Data

Pandas Basics

`Series` and `DataFrame`

Series: a collection of a one-dimensional object containing a sequence of values.
DataFrame: a collection of Series columns with an index.

Pandas Basics

Importing a data set with `read_csv()`

A CSV (comma-separated values) is a plain-text file that uses a comma to separate values (e.g., nba.csv).
The CSV is widely used for storing data, and we will use this throughout the module.
We use the read_csv() function to load a CSV data file.

import pandas as pd
nba = pd.read_csv("https://bcdanl.github.io/data/nba.csv")
type(nba)
nba

The DataFrame is the workhorse of the pandas library and the data structure.

Pandas Basics

Importing a data set with `read_csv()`

nba = pd.read_csv("https://bcdanl.github.io/data/nba.csv",
                  parse_dates = ["Birthday"])
nba

We can use the parse_dates parameter to coerce the values into datetimes.

Pandas Basics

Colab’s Interactive DataFrame Display

from google.colab import data_table
data_table.enable_dataframe_formatter()  # Enabling an interactive DataFrame display
nba

We can use the from keyword when specifying Python package from which we want to import something (e.g., functions).
Colab includes an extension that renders pandas DataFrames into interactive tables.

Getting a Summary of Data

Pandas Basics

DataFrame Terminologies: Variables, Observations, and Values

Each variable is a column.
Each observation is a row.
Each value is a cell.

Pandas Basics

DataFrame Terminologies: Dot Operators, Methods and Attributes

The dot operator (DataFrame.) is used for an attribute or a method on DataFrame.
A method (DataFrame.METHOD()) is a function that we can call on a DataFrame to perform operations, modify data, or derive insights.
- e.g., nba.info()
An attribute (DataFrame.ATTRIBUTE) is a property that provides information about the DataFrame’s structure or content without modifying it.
- e.g., nba.dtype

Pandas Basics

Getting a Summary of a `DataFrame` with `.info()`

nba.info()    # method

nba.shape     # attribute
nba.dtypes    # attribute
nba.columns   # attribute
nba.count()   # method

Every DataFrame object has a .info() method that provides a summary of a DataFrame:
- Variable names (.columns)
- Number of variables/observations (.shape)
- Data type of each variable (.dtypes)
- Number of non-missing values in each variable (.count())
  - Pandas often displays missing values as NaN.

Pandas Basics

Getting a Summary of a `DataFrame` with `.describe()`

nba.describe()
nba.describe(include='all')

.describe() method generates descriptive statistics that summarize the central tendency, dispersion, and distribution of each variable.
- It can also process string-type variables if specified explicitly (include='all').

Selecting Variables

Pandas Basics

Selecting a Variable by its Name

nba_player_name_1 = nba['Name']   # Series
nba_player_name_1

nba_player_name_2 = nba[ ['Name'] ]   # DataFrame
nba_player_name_2

If we want only a specific variable from a DataFrame, we can access the variable with its name using squared brackets, [ ].
- DataFrame[ 'var_1' ]
- DataFrame[ ['var_1'] ]

Pandas Basics

Selecting Multiple Variables by their Names

nba_player_name_team = nba[ ['Name', 'Team'] ]
nba_player_name_team

In order to specify multiple variables by their names, we need to pass in a Python list between the square brackets.
- DataFrame[ ['var_1', 'var_2', ... ] ]

Pandas Basics

Selecting Multiple Variables with `select_dtypes()`

# To include only string variables
nba.select_dtypes(include = "object")

# To exclude string and integer variables
nba.select_dtypes(exclude = ["object", "int"])

We can use the select_dtypes() method to select columns based on their data types.
- The method accepts two parameters, include and exclude.

Counting Methods

Pandas Basics

Counting with `.count()`

nba['Salary'].count()
nba[['Salary']].count()

The .count() counts the number of non-missing values in a Series/DataFrame.

Pandas Basics

Counting with `.value_counts()`

nba['Team'].value_counts()

nba[['Team']].value_counts()

The .value_counts() counts the number of occurrences of each unique value in a Series/DataFrame.

Pandas Basics

Counting with `.nunique()`

nba[['Team']].nunique()

nba.nunique()

The .nunique() counts the number of unique values in each variable in a DataFrame.

Pandas Basics

Let’s do Questions 1-4 in Classwork 2!

Sorting Methods

Pandas Basics

Selecting the first/last `n` observations with `.head()` & `.tail()`

nba.head()
nba.head(10)

nba.tail()
nba.tail(10)

We can use the .head()/.tail() method of a DataFrame to keep only the first/last n observations.
- This can be useful with sorting methods.

Pandas Basics

Sorting by a Single Variable with `sort_values()`

# The two lines below are equivalent
nba.sort_values(["Name"])
nba.sort_values(by = ["Name"])

The sort_values() method’s first parameter, by, accepts the variables that pandas should use to sort the DataFrame.

Pandas Basics

Sorting by a Single Variable with `sort_values()`

nba.sort_values(["Name"], ascending = False)

The sort_values() method’s ascending parameter determines the sort order.
- ascending has a default argument of True.
- By default, pandas will sort:
  - A variable of numbers in increasing order;
  - A variable of strings in alphabetical order;
  - A variable of datetimes in chronological order.

Pandas Basics

Sorting by a Single Variable with `sort_values()` and `head()` & `tail()`

df = nba.sort_values(["Salary"])
df.head(5)

df = nba.sort_values(["Salary"])
df.tail(5)

sort_values() with .head() or .tail() can be useful to find the observations with the n smallest/largest values in a variable.

Pandas Basics

Method Chaining

(
    nba
    .sort_values(['Salary'])
    .head(5)
)

DataFrame has various methods that modify the existing DataFrame.
Method Chaining: We can call methods sequentially without the need to store intermediate results.

Pandas Basics

Sorting by a Single Variable with `nsmallest()` and `nlargest()`

nba.nsmallest(5, 'Salary')
nba.nlargest(5, 'Salary')

nsmallest() are useful to get the first n observations ordered by a variable in ascending order.
nlargest() are useful to get the first n observations ordered by a variable in descending order.

Pandas Basics

Sorting by a Single Variable with `nsmallest()` and `nlargest()`

nba.nsmallest(4, 'Salary', keep = "all")
nba.nlargest(4, 'Salary', keep = "all")

keep = "all" keeps all duplicates, even it means selecting more than n observations.

Pandas Basics

Sorting by Multiple Variables with `sort_values()`

nba.sort_values(["Team", "Name"])
nba.sort_values(by = ["Team", "Name"])

We can sort a DataFrame by multiple columns by passing a list to the by parameter.

Pandas Basics

Sorting by Multiple Variables with `sort_values()`

nba.sort_values(by = ["Team", "Name"], 
                ascending = False)

We can pass a single Boolean to the ascending parameter to apply the same sort order to each column.

Pandas Basics

Sorting by Multiple Variables with `sort_values()`

nba.sort_values(by = ["Team", "Name"], 
                ascending = [False, True])

If we want to sort each variable in a different order, we can pass a Boolean list to the ascending parameter.

Pandas Basics

Sorting by Multiple Variables with `sort_values()`

Q. Which players on each team are paid the most?

Pandas Basics

Sorting by Multiple Variables with `sort_index()`

# The two lines below are equivalent
nba.sort_index()
nba.sort_index(ascending = True).

nba.sort_index(ascending = False).

How can we return it to its original form of DataFrame?
Our nba DataFrame still has its numeric index.
If we could sort the data set by index positions rather than by column values, we could return it to its original shape.

Pandas Basics

Changing the Order of Variables with `sort_index()`

# The two lines below are equivalent
nba.sort_index(axis = "columns")
nba.sort_index(axis = 1)

The sort_index() method can also be used to change the order of variables in an alphabetical order.
- We need to add an axis parameter and pass it an argument of "columns" or 1.

Setting a New Index

Pandas Basics

Setting a New Index

We can use the set_index() method when we want to change the current index of a DataFrame to one or more existing columns.
- This is particularly useful when:
  - We have a column that uniquely identifies each observation (e.g., ID);
  - We sometimes want to use an unique identifier as the index for more efficient data wrangling.

Pandas Basics

Setting a New Index with `set_index()`

# The two lines below are equivalent
nba.set_index(keys = "Name")
nba.set_index("Name")

The set_index() method returns a new DataFrame with a given column set as the index.
- Its first parameter, keys, accepts the column name.

Pandas Basics

Re-setting an Index with `reset_index()`

nba2 = nba.set_index("Name")
nba2.reset_index(inplace=True)    # Useful for the chain of method operations

We use the reset_index() method:
- When we want to convert the index back into a DataFrame column;
- When we need to reset the index to the default integer index.
Note: With inplace=True, the operation alters the original DataFrame directly.

Lecture 2

Pandas Basics

Pandas Basics

Learning Objectives

Loading Data

Pandas Basics

Series and DataFrame

Pandas Basics

Importing a data set with read_csv()

Pandas Basics

Importing a data set with read_csv()

Pandas Basics

Colab’s Interactive DataFrame Display

Getting a Summary of Data

Pandas Basics

DataFrame Terminologies: Variables, Observations, and Values

Pandas Basics

DataFrame Terminologies: Dot Operators, Methods and Attributes

Pandas Basics

Getting a Summary of a DataFrame with .info()

Pandas Basics

Getting a Summary of a DataFrame with .describe()

Selecting Variables

Pandas Basics

Selecting a Variable by its Name

Pandas Basics

Selecting Multiple Variables by their Names

Pandas Basics

Selecting Multiple Variables with select_dtypes()

Counting Methods

Pandas Basics

Counting with .count()

Pandas Basics

Counting with .value_counts()

Pandas Basics

Counting with .nunique()

Pandas Basics

Sorting Methods

Pandas Basics

Selecting the first/last n observations with .head() & .tail()

Pandas Basics

Sorting by a Single Variable with sort_values()

Pandas Basics

Sorting by a Single Variable with sort_values()

Pandas Basics

Sorting by a Single Variable with sort_values() and head() & tail()

Pandas Basics

Method Chaining

Pandas Basics

Sorting by a Single Variable with nsmallest() and nlargest()

Pandas Basics

Sorting by a Single Variable with nsmallest() and nlargest()

Pandas Basics

Sorting by Multiple Variables with sort_values()

Pandas Basics

Sorting by Multiple Variables with sort_values()

Pandas Basics

Sorting by Multiple Variables with sort_values()

Pandas Basics

Sorting by Multiple Variables with sort_values()

Pandas Basics

Sorting by Multiple Variables with sort_index()

Pandas Basics

Changing the Order of Variables with sort_index()

Setting a New Index

Pandas Basics

Setting a New Index

Pandas Basics

Setting a New Index with set_index()

Pandas Basics

Re-setting an Index with reset_index()

Pandas Basics

Pandas Basics

Referenes

`Series` and `DataFrame`

Importing a data set with `read_csv()`

Importing a data set with `read_csv()`

Getting a Summary of a `DataFrame` with `.info()`

Getting a Summary of a `DataFrame` with `.describe()`

Selecting Multiple Variables with `select_dtypes()`

Counting with `.count()`

Counting with `.value_counts()`

Counting with `.nunique()`

Selecting the first/last `n` observations with `.head()` & `.tail()`

Sorting by a Single Variable with `sort_values()`

Sorting by a Single Variable with `sort_values()`

Sorting by a Single Variable with `sort_values()` and `head()` & `tail()`

Sorting by a Single Variable with `nsmallest()` and `nlargest()`

Sorting by a Single Variable with `nsmallest()` and `nlargest()`

Sorting by Multiple Variables with `sort_values()`

Sorting by Multiple Variables with `sort_values()`

Sorting by Multiple Variables with `sort_values()`

Sorting by Multiple Variables with `sort_values()`

Sorting by Multiple Variables with `sort_index()`

Changing the Order of Variables with `sort_index()`

Setting a New Index with `set_index()`

Re-setting an Index with `reset_index()`