Lecture 7

pandas Basics - Getting a Summary of Data; Selecting Variables; Counting Methods; Sorting Methods

Byeong-Hak Choe

SUNY Geneseo

February 12, 2025

nba DataFrame

  • Let’s read the nba.csv file as nba:
# Below is to import the pandas library as pd
import pandas as pd 

# Below is for an interactive display of DataFrame in Colab
from google.colab import data_table  
data_table.enable_dataframe_formatter()

# Below is to read nba.csv as nba DataFrame
nba = pd.read_csv("https://bcdanl.github.io/data/nba.csv",
                  parse_dates = ["Birthday"])

Getting a Summary of Data

DataFrame Terminologies: Variables, Observations, and Values

  1. Each variable is a column.
  2. Each observation is a row.
  3. Each value is a cell.

DataFrame Terminologies

Dot Operators, Methods, and Attributes

Dot operator

  • The dot operator (DataFrame.) is used for an attribute or a method on objects.

Method

  • A method (DataFrame.METHOD()) is a function that we can call on a DataFrame to perform operations, modify data, or derive insights.
    • e.g., nba.info()

Attribute

  • An attribute (DataFrame.ATTRIBUTE) is a property that provides information about the DataFrame’s structure or content without modifying it.
    • e.g., nba.dtype

Getting a Summary of a DataFrame with .info()

nba.info()    # method
nba.shape     # attribute
nba.dtypes    # attribute
nba.columns   # attribute
nba.count()   # method
  • Every DataFrame object has a .info() method that provides a summary of a DataFrame:
    • Variable names (.columns)
    • Number of variables/observations (.shape)
    • Data type of each variable (.dtypes)
    • Number of non-missing values in each variable (.count())
      • Pandas often displays missing values as NaN.

Getting a Summary of a DataFrame with .describe()

nba.describe()
nba.describe(include='all')
  • .describe() method generates descriptive statistics that summarize the central tendency, dispersion, and distribution of each variable.
    • It can also process string-type variables if specified explicitly (include='all').

Selecting Variables

Selecting a Variable by its Name

nba_player_name_s = nba['Name']   # Series
nba_player_name_s

nba_player_name_df = nba[ ['Name'] ]   # DataFrame
nba_player_name_df
  • If we want only a specific variable from a DataFrame, we can access the variable with its name using squared brackets, [ ].
    • DataFrame[ 'var_1' ]
    • DataFrame[ ['var_1'] ]

Selecting Multiple Variables by their Names

nba_player_name_team = nba[ ['Name', 'Team'] ]
nba_player_name_team
  • In order to specify multiple variables by their names, we need to pass in a Python list between the square brackets.
    • DataFrame[ ['var_1', 'var_2', ... ] ]
    • This is also how we can relocate variables by the order specified in the list.

Selecting Multiple Variables with select_dtypes()

# To include only string variables
nba.select_dtypes(include = "object")

# To exclude string and integer variables
nba.select_dtypes(exclude = ["object", "int"])
  • We can use the select_dtypes() method to select columns based on their data types.
    • The method accepts two parameters, include and exclude.

Counting Methods

Counting with .count()

nba['Salary'].count()
nba[['Salary']].count()
  • The .count() counts the number of non-missing values in a Series/DataFrame.

Counting with .value_counts()

nba['Team'].value_counts()
nba[['Team']].value_counts()
  • The .value_counts() counts the number of occurrences of each unique value in a Series/DataFrame.

Counting with .nunique()

nba[['Team']].nunique()
nba.nunique()
  • The .nunique() counts the number of unique values in each variable in a DataFrame.

Pandas Basics

Let’s do Questions 1-3 in Classwork 5!