Lecture 2

Pandas Basics - Loading, Summarizing, Selecting, Counting, Sorting, and Indexing Data

Byeong-Hak Choe

SUNY Geneseo

February 13, 2024

Pandas Basics

Pandas Basics

Learning Objectives

  • Loading DataFrame
  • Getting a Summary of DataFrame
  • Selecting Columns in a DataFrame
  • Counting Values in a DataFrame
  • Sorting DataFrame
  • Indexing DataFrame

Loading Data

Pandas Basics

Series and DataFrame

  • Series: a collection of a one-dimensional object containing a sequence of values.

  • DataFrame: a collection of Series columns with an index.

Pandas Basics

Importing a data set with read_csv()

  • A CSV (comma-separated values) is a plain-text file that uses a comma to separate values (e.g., nba.csv).

  • The CSV is widely used for storing data, and we will use this throughout the module.

  • We use the read_csv() function to load a CSV data file.

import pandas as pd
nba = pd.read_csv("https://bcdanl.github.io/data/nba.csv")
type(nba)
nba
  • The DataFrame is the workhorse of the pandas library and the data structure.

Pandas Basics

Importing a data set with read_csv()

nba = pd.read_csv("https://bcdanl.github.io/data/nba.csv",
                  parse_dates = ["Birthday"])
nba
  • We can use the parse_dates parameter to coerce the values into datetimes.

Pandas Basics

Colab’s Interactive DataFrame Display

from google.colab import data_table
data_table.enable_dataframe_formatter()  # Enabling an interactive DataFrame display
nba
  • We can use the from keyword when specifying Python package from which we want to import something (e.g., functions).

  • Colab includes an extension that renders pandas DataFrames into interactive tables.

Getting a Summary of Data

Pandas Basics

DataFrame Terminologies: Variables, Observations, and Values

  1. Each variable is a column.
  2. Each observation is a row.
  3. Each value is a cell.

Pandas Basics

DataFrame Terminologies: Dot Operators, Methods and Attributes

  • The dot operator (DataFrame.) is used for an attribute or a method on DataFrame.

  • A method (DataFrame.METHOD()) is a function that we can call on a DataFrame to perform operations, modify data, or derive insights.

    • e.g., nba.info()
  • An attribute (DataFrame.ATTRIBUTE) is a property that provides information about the DataFrame’s structure or content without modifying it.

    • e.g., nba.dtype

Pandas Basics

Getting a Summary of a DataFrame with .info()

nba.info()    # method
nba.shape     # attribute
nba.dtypes    # attribute
nba.columns   # attribute
nba.count()   # method
  • Every DataFrame object has a .info() method that provides a summary of a DataFrame:
    • Variable names (.columns)
    • Number of variables/observations (.shape)
    • Data type of each variable (.dtypes)
    • Number of non-missing values in each variable (.count())
      • Pandas often displays missing values as NaN.

Pandas Basics

Getting a Summary of a DataFrame with .describe()

nba.describe()
nba.describe(include='all')
  • .describe() method generates descriptive statistics that summarize the central tendency, dispersion, and distribution of each variable.
    • It can also process string-type variables if specified explicitly (include='all').

Selecting Variables

Pandas Basics

Selecting a Variable by its Name

nba_player_name_1 = nba['Name']   # Series
nba_player_name_1

nba_player_name_2 = nba[ ['Name'] ]   # DataFrame
nba_player_name_2
  • If we want only a specific variable from a DataFrame, we can access the variable with its name using squared brackets, [ ].
    • DataFrame[ 'var_1' ]
    • DataFrame[ ['var_1'] ]

Pandas Basics

Selecting Multiple Variables by their Names

nba_player_name_team = nba[ ['Name', 'Team'] ]
nba_player_name_team
  • In order to specify multiple variables by their names, we need to pass in a Python list between the square brackets.
    • DataFrame[ ['var_1', 'var_2', ... ] ]

Pandas Basics

Selecting Multiple Variables with select_dtypes()

# To include only string variables
nba.select_dtypes(include = "object")

# To exclude string and integer variables
nba.select_dtypes(exclude = ["object", "int"])
  • We can use the select_dtypes() method to select columns based on their data types.
    • The method accepts two parameters, include and exclude.

Counting Methods

Pandas Basics

Counting with .count()

nba['Salary'].count()
nba[['Salary']].count()
  • The .count() counts the number of non-missing values in a Series/DataFrame.

Pandas Basics

Counting with .value_counts()

nba['Team'].value_counts()
nba[['Team']].value_counts()
  • The .value_counts() counts the number of occurrences of each unique value in a Series/DataFrame.

Pandas Basics

Counting with .nunique()

nba[['Team']].nunique()
nba.nunique()
  • The .nunique() counts the number of unique values in each variable in a DataFrame.

Pandas Basics

Let’s do Questions 1-4 in Classwork 2!

Sorting Methods

Pandas Basics

Selecting the first/last n observations with .head() & .tail()

nba.head()
nba.head(10)
nba.tail()
nba.tail(10)
  • We can use the .head()/.tail() method of a DataFrame to keep only the first/last n observations.
    • This can be useful with sorting methods.

Pandas Basics

Sorting by a Single Variable with sort_values()

# The two lines below are equivalent
nba.sort_values(["Name"])
nba.sort_values(by = ["Name"])
  • The sort_values() method’s first parameter, by, accepts the variables that pandas should use to sort the DataFrame.

Pandas Basics

Sorting by a Single Variable with sort_values()

nba.sort_values(["Name"], ascending = False)
  • The sort_values() method’s ascending parameter determines the sort order.
    • ascending has a default argument of True.
    • By default, pandas will sort:
      • A variable of numbers in increasing order;
      • A variable of strings in alphabetical order;
      • A variable of datetimes in chronological order.

Pandas Basics

Sorting by a Single Variable with sort_values() and head() & tail()

df = nba.sort_values(["Salary"])
df.head(5)
df = nba.sort_values(["Salary"])
df.tail(5)
  • sort_values() with .head() or .tail() can be useful to find the observations with the n smallest/largest values in a variable.

Pandas Basics

Method Chaining

(
    nba
    .sort_values(['Salary'])
    .head(5)
)
  • DataFrame has various methods that modify the existing DataFrame.
  • Method Chaining: We can call methods sequentially without the need to store intermediate results.

Pandas Basics

Sorting by a Single Variable with nsmallest() and nlargest()

nba.nsmallest(5, 'Salary')
nba.nlargest(5, 'Salary')
  • nsmallest() are useful to get the first n observations ordered by a variable in ascending order.

  • nlargest() are useful to get the first n observations ordered by a variable in descending order.

Pandas Basics

Sorting by a Single Variable with nsmallest() and nlargest()

nba.nsmallest(4, 'Salary', keep = "all")
nba.nlargest(4, 'Salary', keep = "all")
  • keep = "all" keeps all duplicates, even it means selecting more than n observations.

Pandas Basics

Sorting by Multiple Variables with sort_values()

nba.sort_values(["Team", "Name"])
nba.sort_values(by = ["Team", "Name"])
  • We can sort a DataFrame by multiple columns by passing a list to the by parameter.

Pandas Basics

Sorting by Multiple Variables with sort_values()

nba.sort_values(by = ["Team", "Name"], 
                ascending = False)
  • We can pass a single Boolean to the ascending parameter to apply the same sort order to each column.

Pandas Basics

Sorting by Multiple Variables with sort_values()

nba.sort_values(by = ["Team", "Name"], 
                ascending = [False, True])
  • If we want to sort each variable in a different order, we can pass a Boolean list to the ascending parameter.

Pandas Basics

Sorting by Multiple Variables with sort_values()

Q. Which players on each team are paid the most?

Pandas Basics

Sorting by Multiple Variables with sort_index()

# The two lines below are equivalent
nba.sort_index()
nba.sort_index(ascending = True).
nba.sort_index(ascending = False).
  • How can we return it to its original form of DataFrame?

  • Our nba DataFrame still has its numeric index.

  • If we could sort the data set by index positions rather than by column values, we could return it to its original shape.

Pandas Basics

Changing the Order of Variables with sort_index()

# The two lines below are equivalent
nba.sort_index(axis = "columns")
nba.sort_index(axis = 1)
  • The sort_index() method can also be used to change the order of variables in an alphabetical order.
    • We need to add an axis parameter and pass it an argument of "columns" or 1.

Setting a New Index

Pandas Basics

Setting a New Index

  • We can use the set_index() method when we want to change the current index of a DataFrame to one or more existing columns.
    • This is particularly useful when:
      • We have a column that uniquely identifies each observation (e.g., ID);
      • We sometimes want to use an unique identifier as the index for more efficient data wrangling.

Pandas Basics

Setting a New Index with set_index()

# The two lines below are equivalent
nba.set_index(keys = "Name")
nba.set_index("Name")
  • The set_index() method returns a new DataFrame with a given column set as the index.
    • Its first parameter, keys, accepts the column name.

Pandas Basics

Re-setting an Index with reset_index()

nba2 = nba.set_index("Name")
nba2.reset_index(inplace=True)    # Useful for the chain of method operations
  • We use the reset_index() method:
    • When we want to convert the index back into a DataFrame column;
    • When we need to reset the index to the default integer index.
  • Note: With inplace=True, the operation alters the original DataFrame directly.

Pandas Basics

Let’s do Questions 5-7 in Classwork 2!

Pandas Basics

Referenes

  • Pandas in Action, Boris Paskhaver (Author), 2021, Manning