Lecture 7

Reshaping DataFrames; Joining DataFrames; Visualizing DataFrames

Byeong-Hak Choe

bchoe@geneseo.edu

SUNY Geneseo

March 19, 2024

Learning Objectives

Reshaping DataFrames
Joining DataFrames
Visualizing DataFrames

Reshaping `DataFrames`

Tidy `DataFrames`

A variable is a quantity, quality, or property that we can measure/count.
- It could be anything from a person’s age, the temperature on a given day, or the sales figures for a product.
An observation represents a unit of analysis.
- It could be a set of measurements made under similar conditions (e.g, similar unit of entity, time, and/or geography).
- E.g., In a DataFrame of patient information, each observation could correspond to a single patient’s data record.
A value is the actual data point corresponding to the variable and observation.

Reshaping `DataFrames`

Tidy `DataFrames`

There are three interrelated rules that make a DataFrame tidy:
- Each variable is a column; each column is a variable.
- Each observation is a row; each row is an observation.
- Each value is a cell; each cell is a single value.

Reshaping `DataFrames`

import pandas as pd
# Below is for an interactive display of DataFrame in Colab
from google.colab import data_table
data_table.enable_dataframe_formatter()

A DataFrame can be given in a format unsuited for the analysis that we would like to perform on it.
- A DataFrame may have larger structural problems that extend beyond the data.
- Perhaps the DataFrame stores its values in a format that makes it easy to extract a single row but difficult to aggregate the data.
Reshaping a DataFrame means manipulating it into a different shape.
In this section, we will discuss pandas techniques for molding DataFrame into the shapes we desire.

Reshaping `DataFrames`

Long vs. Wide `DataFrames`

The following DataFrames measure temperatures in two cities over two days.

df_wide = pd.DataFrame({
    'Weekday': ['Tuesday', 'Wednesday'],
    'Miami': [80, 83],
    'Rochester': [57, 62],
    'St. Louis': [71, 75]
})

df_long = pd.DataFrame({
    'Weekday': ['Tuesday', 'Wednesday', 'Tuesday', 'Wednesday', 'Tuesday', 'Wednesday'],
    'City': ['Miami', 'Miami', 'Rochester', 'Rochester', 'St. Louis', 'St. Louis'],
    'Temperature': [80, 83, 57, 62, 71, 75]
})

Reshaping `DataFrames`

Long vs. Wide `DataFrames`

A DataFrame can store its values in wide or long format.
These names reflect the direction in which the data set expands as we add more values to it.
- A long DataFrame increases in height.
- A wide DataFrame increases in width.

Reshaping `DataFrames`

Long vs. Wide `DataFrames`

The optimal storage format for a DataFrame depends on the insight we are trying to glean from it.
- We consider making DataFrames longer if one variable is spread across multiple columns.
- We consider making DataFrames wider if one observation is spread across multiple rows.

Reshaping `DataFrames`

`melt()` and `pivot()`

melt() makes DataFrame longer.
pivot() and pivot_table() make DataFrame wider.

Reshaping `DataFrames`

Make `DataFrame` Longer with `melt()`

df_wide_to_long = (
    df_wide
    .melt()
)

Reshaping `DataFrames`

Make `DataFrame` Longer with `melt()`

df_wide_to_long = (
    df_wide
    .melt(id_vars = "Weekday")
)

melt() can take a few parameters:
- id_vars is a container (string, list, tuple, or array) that represents the variables that will remain as is.
- id_vars can indicate which column should be the “identifier”.

Reshaping `DataFrames`

Make `DataFrame` Longer with `melt()`

df_wide_to_long = (
    df_wide
    .melt(id_vars = "Weekday",
          var_name = "City",
          value_name = "Temperature")
)

melt() can take a few parameters:
- var_name is a string for the new column name for the variable.
- value_name is a string for the new column name that represents the values for the var_name.

Reshaping `DataFrames`

Make `DataFrame` Longer with `melt()`

df_wide_to_long = (
    df_wide
    .melt(id_vars = "Weekday",
          var_name = "City",
          value_name = "Temperature",
          value_vars = ['Miami', 'Rochester'])
)

melt() can take a few parameters:
- value_vars parameter allows us to select which specific columns we want to “melt”.
- By default, it will melt all the columns not specified in the id_vars parameter.

Reshaping `DataFrames`

Make `DataFrame` Wider with `pivot()`

df_long_to_wide = (
    df_long
    .pivot(index = "Weekday",
           columns = "City",
           values = "Temperature"  # To avoid having MultiIndex
        )
    .reset_index()
    )

When using pivot(), we need to specify a few parameters:
- index that takes the column to pivot on;
- columns that takes the column to be used to make the new variable names of the wider DataFrame;
- values that takes the column that provides the values of the variables in the wider DataFrame.

Reshaping `DataFrames`

Let’s consider the following wide-form DataFrame, df, containing information about the number of courses each student took from each department in each year.

dict_data = {"Name": ["Donna", "Donna", "Mike", "Mike"],
             "Department": ["ECON", "DANL", "ECON", "DANL"],
             "2018": [1, 2, 3, 1],
             "2019": [2, 3, 4, 2],
             "2020": [5, 1, 2, 2]}
df = pd.DataFrame(dict_data)

df_longer = df.melt(id_vars=["Name", "Department"], 
                    var_name="Year", 
                    value_name="Number of Courses")

The pivot_table() method can take both a string and a list of variables for the index parameter.
- The pivot() can take only a string for index.

Reshaping `DataFrames`

Let’s consider the following wide-form DataFrame, df, containing information about the number of courses each student took from each department in each year.

dict_data = {"Name": ["Donna", "Donna", "Mike", "Mike"],
             "Department": ["ECON", "DANL", "ECON", "DANL"],
             "2018": [1, 2, 3, 1],
             "2019": [2, 3, 4, 2],
             "2020": [5, 1, 2, 2]}
df = pd.DataFrame(dict_data)

df_longer = df.melt(id_vars=["Name", "Department"], 
                    var_name="Year", 
                    value_name="Number of Courses")

Q. How can we use the df_longer to create the wide-form DataFrame, df_wider, which is equivalent to the df?

Reshaping `DataFrames`

Let’s do Part 1 of Classwork 7!

Joining `DataFrames`

Relational Data

Sometimes, one data set is scattered across multiple files.
- The size of the files can be huge.
- The data collection process can be scattered across time and space.
- E.g., DataFrame for county-level data and DataFrame for geographic information, such as longitude and latitude.
Sometimes we want to combine two or more DataFrames based on common data values in those DataFrames.
- This task is known in the database world as performing a “join.”
- We can do this with the merge() method in Pandas.

Joining `DataFrames`

Relational Data

The variables that are used to connect each pair of tables are called keys.

Joining `DataFrames` with `merge()`

x = pd.DataFrame({
    'key': [1, 2, 3],
    'val_x': ['x1', 'x2', 'x3']
})

y = pd.DataFrame({
    'key': [1, 2, 4],
    'val_y': ['y1', 'y2', 'y3']
})

The colored column represents the “key” variable.
The grey column represents the “value” column.

An inner join matches pairs of observations whenever their keys are equal:

# the default value for 'how' is 'inner'
# so it doesn't actually need to be specified
merge_inner = pd.merge(x, y, on='key', how='inner')
merge_inner_x = x.merge(y, on='key', how='inner')
merge_inner_x_how = x.merge(y, on='key')

A left join keeps all observations in x.

merge_left = pd.merge(x, y, on='key', how='left')
merge_left_x = x.merge(y, on='key', how='left')

The most commonly used join is the left join.

A right join keeps all observations in y.

merge_right = pd.merge(x, y, on='key', how='right')
merge_right_x = x.merge(y, on='key', how='right')

A full join keeps all observations in x and y.

merge_outer = pd.merge(x, y, on='key', how='outer')
merge_outer_x = x.merge(y, on='key', how='outer')

Joining `DataFrames` with `merge()`

Duplicate keys

one-to-many
many-to-many

One DataFrame has duplicate keys (a one-to-many relationship).

x = pd.DataFrame({
    'key':[1, 2, 2, 3],
    'val_x':['x1', 'x2', 'x3', 'x4']})

y = pd.DataFrame({
    'key':[1, 2],
    'val_y':['y1', 'y2'] })
one_to_many = x.merge(y, on='key', 
                         how='left')

Both DataFrames have duplicate keys (many-to-many relationship).

x = pd.DataFrame({
  'key':[1, 2, 2, 3],
  'val_x':['x1','x2','x3','x4']})

y = pd.DataFrame({
  'key': [1, 2, 2, 3],
  'val_y': ['y1', 'y2', 'y3', 'y4'] })
many_to_many = x.merge(y, on='key', 
                          how='left')

Joining `DataFrames` with `merge()`

Defining the key columns

If the left and right columns do not have the same name for the key variables, we can use the left_on and right_on parameters instead.

x = pd.DataFrame({
  'key_x': [1, 2, 3],
  'val_x': ['x1', 'x2', 'x3']
})

y = pd.DataFrame({
  'key_y': [1, 2],
  'val_y': ['y1', 'y2'] })

keys_xy = 
  x.merge(y, left_on = 'key_x', 
             right_on = 'key_y', 
             how = 'left')

Joining `DataFrames` with `merge()`

Let’s do Part 2 of Classwork 7!

Visualizing `DataFrames` with `seaborn`

Graphs and charts let us explore and learn about the structure of the information we have in DataFrame.
Good data visualizations make it easier to communicate our ideas and findings to other people.

Visualizing `DataFrames` with `seaborn`

Key Points in Data Visualization

A graphic should display as much information as it can, with the lowest possible cognitive strain to the viewer.
Strive for clarity.
- Make the data stand out.
  - Avoid too many superimposed elements (e.g., too many curves in the same graphing space).
  - Avoid having the data all skewed to one side or the other of your graph.
Visualization is an iterative process.
- We should try making data visualization informative as much as we can.

Visualizing `DataFrames` with `seaborn`

Getting started with `seaborn`

seaborn is a Python data visualization library based on matplotlib.
- It allows us to easily create beautiful but complex graphics using a simple interface.

import seaborn as sns
sns.set_theme(rc={'figure.dpi': 600, 
                  'figure.figsize': (5, 3.75)})   # better quality

Visualizing `DataFrames` with `seaborn`

Categorical vs. Continuous Variables

Categorical
Continuous
Integer

A categorical variable is a variable whose value is obtained by counting.
- Number of red marbles in a jar
- Number of heads when flipping three coins
- Students’ letter grade
- US state/county
Categorical variables should be category or string (object in DataFrame).

A continuous variable is a variable whose value is obtained by measuring and can have a decimal or fractional value.
- Height/weight of students
- Time it takes to get to school
- Fuel efficiency of a vehicle (e.g., miles per gallon)
Continuous variables should be float.
- We can use astype('float') to make a variable float.

For data visualization, integer-type variables could be treated as either categorical or continuous, depending on the context of analysis.
If the values of an integer-type variable means an intensity or an order, the integer variable could be continuous.
- A variable of age integers (18, 19, 20, 21, …) could be continuous.
- A variable of integer-valued MPG (27, 28, 29, 30, …) could be continuous.
If not, the integer variable is categorical.
- A variable of month integers (1, ,2, …, 12) is categorical.
We can use astype('int') to make a variable int.

Visualizing `DataFrames` with `seaborn`

Making Discoveries from a Data Set

Distribution
Relationship
Types
Summary Stat.

From the distribution plots, we want to see the variation, the tendency of the values of a variable to change.
- Which values are the most common? Why?
- How normal is the data? How many peaks are there in the distribution?
- Which values are rare? Why? Does that match our expectations?
- Can we see any unusual patterns? What might explain them?

From the plots with two or more variables, we want to see co-variation, the tendency for the values of two or more variables to vary together in a related way.
What type of co-variation occurs between variables?
- Are they positively associated?
- Are they negatively associated?
- Are there no association between them?

A distribution of a categorical variable (e.g., bar charts and more)
A distribution of a continuous variable (e.g., histograms and more)
A relationship between two categorical variables (e.g., bar charts and more)
A relationship between two continuous variables (e.g., scatter plots and more)
A relationship between a categorical variable and a continuous variable (e.g., boxplots and more)
A time trend of a variable (e.g., line plots and more)

Use DataFrame.describe() or DataFrameGroupBy.describe() to know:
- Mean (Average, Expected Value);
- Standard Deviation (SD)
- Minimum, First Quartile (Q1), Median (Q2), Third Quartile (Q3), and Maximum.

Visualizing `DataFrames` with `seaborn`

Data Visualization Workflow

Step 1
Step 2
Step 3
Step 4

Read the descriptions of variables in a DataFrame if available.
Check the unit of an observation: Are all values in one single observation measured for:
- One organization (e.g., university, company)?
- One geographic unit (e.g., zip code, county, state, country)?
- One person?
- One person in a specific point of time (e.g., year, month, week)?
- One geographic unit in a specific point of time?
- One organization in a specific point of time?

Figure out whether variables of interests are categorical or continuous.
- Check what types of the variables are (e.g., float, int, datetime64, object, category).
- Use astype(DTYPE) if needed.

Choose appropriate
- Geometric objects (e.g., sns.histplot, sns.scatterplot)
- Aesthetic mappings ((x = , y = , color = , hue = ))
- Faceting (FacetGrid(DATA, row = , col = ).map(GEOMETRIC_OBJECT, VARIABLES) or col/row in the function of geometric object)

Pay attention to the unit of x andy axes.
If needed, transform a given DataFrame (e.g., subset of observations, new variables, summarized DataFrame).

Visualizing `DataFrames` with `seaborn`

Getting started with `seaborn`

Let’s get the names of DataFrames provided by the seaborn library:

import seaborn as sns
sns.get_dataset_names()

Let’s use the titanic and tips DataFrames:

titanic = sns.load_dataset('titanic')
tips = sns.load_dataset('tips')

Visualizing `DataFrames` with `seaborn`

Bar Chart

sns.countplot(data = titanic,
              x =  'sex')

Mapping
- data: DataFrame.
- x: Name of a categorical variable (column) in DataFrame

A bar chart is used to plot the frequency of the different categories.
- It is useful to visualize how values of a categorical variable are distributed.
- A variable is categorical if it can only take one of a small set of values.
We use sns.countplot() function to plot a bar chart

Visualizing `DataFrames` with `seaborn`

Bar Chart

sns.countplot(data = titanic,
              x = 'sex',
              hue = 'survived')

Mapping
- hue: Name of a categorical variable

We can further break up the bars in the bar chart based on another categorical variable.
- This is useful to visualize the relationship between the two categorical variables.

Visualizing `DataFrames` with `seaborn`

Histogram

sns.histplot(data = titanic,
             x =  'age', 
             bins = 5)

Mapping
- bins: Number of bins
- binwidth: Width of each bin

A histogram is a continuous version of a bar chart.
- It is used to plot the frequency of the different values.
- It is useful to visualize how values of a continuous variable are distributed.
We use sns.histplot() function to plot a histogram

Visualizing `DataFrames` with `seaborn`

Histogram

sns.boxplot(data = tips,
            y = 'total_bill')

sns.boxplot(data = tips,
            x = 'time', 
            y = 'total_bill')

A boxplot computes a summary of the distribution and then display a specially formatted box.
- It is useful to visualize how values of a continuous variable are distributed across different values of another (categorical) variable.
We use sns.histplot() function to plot a histogram.

Visualizing `DataFrames` with `seaborn`

Scatter plot

sns.scatterplot(data = tips,
                x = 'total_bill', 
                y = 'tip')

Mapping
- x: Name of a continuous variable on the horizontal axis
- y: Name of a continuous variable on the vertical axis

A scatter plot is used to display the relationship between two continuous variables.
- We can see co-variation as a pattern in the scattered points.
We use sns.scatterplot() function to plot a scatter plot.

Visualizing `DataFrames` with `seaborn`

Scatter plot

sns.scatterplot(data = tips,
                x = 'total_bill', 
                y = 'tip',
                hue = 'smoker')

To the scatter plot, we can add a hue-VARIABLE mapping to display how the relationship between two continuous variables varies by VARIABLE.
Suppose we are interested in the following question:
- Q. Does a smoker and a non-smoker have a difference in tipping behavior?

Visualizing `DataFrames` with `seaborn`

Fitted line

sns.lmplot(data = tips,
           x = 'total_bill', 
           y = 'tip')

From the scatter plot, it is often difficult to clearly see the relationship between two continuous variables.
- sns.lmplot() adds a line that fits well into the scattered points.
- On average, the fitted line describes the relationship between two continuous variables.

Visualizing `DataFrames` with `seaborn`

Transparency with `alpha`

sns.scatterplot(data = tips,
                x = 'total_bill', 
                y = 'tip',
                alpha = .25)

sns.lmplot(data = tips,
           x = 'total_bill', 
           y = 'tip',
           scatter_kws = {'alpha' : 0.2})

In a scatter plot, adding transparency with alpha helps address many data points on the same location.
- We can map alpha to number between 0 and 1.

Visualizing `DataFrames` with `seaborn`

Scatter plot

sns.lmplot(data = tips,
           x = 'total_bill', 
           y = 'tip',
           hue = 'smoker',
           scatter_kws = {'alpha' : 0.5})

To the scatter plot, we can add a hue-VARIABLE mapping to display how the relationship between two continuous variables varies by VARIABLE.
Using the fitted lines, let’s answer the following question:
- Q. Does a smoker and a non-smoker have a difference in tipping behavior?

Visualizing `DataFrames` with `seaborn`

Line cahrt

path_csv = 'https://bcdanl.github.io/data/dji.csv'
dow = pd.read_csv(path_csv)
dow['Date'] = dow['Date'].astype('datetime64')

sns.lineplot(data = dow,
             x =  'Date', 
             y =  'Close')

Mapping
- x: Name of a continuous variable (often time variable) on the horizontal axis
- y: Name of a continuous variable on the vertical axis

A line chart is used to display the trend in a continuous variable or the change in a continuous variable over other variable.
- It draws a line by connecting the scattered points in order of the variable on the x-axis, so that it highlights exactly when changes occur.
We use sns.lineplot() function to plot a line plot.

Visualizing `DataFrames` with `seaborn`

Line cahrt

healthexp = ( sns.load_dataset("healthexp")
             .sort_values(["Country", "Year"])
             .query("Year <= 2020") )
healthexp.head()

sns.lineplot(data = healthexp,
             x =  'Year', 
             y =  'Life_Expectancy',
             hue = 'Country')

For line charts, we often need to group or connect observations to visualize the number of distinct lines.

Visualizing `DataFrames` with `seaborn`

Faceting

Faceting allows us to plot subsets (facets) of our data across subplots.

Faceting
FacetGrid().map()
Alternative?

First, we create a .FacetGrid() object with the data we will be using and define how it will be subset with the row and col arguments:
Second, we then use the .map() method to run a plotting function on each of the subsets, passing along any necessary arguments.

(
 sns.FacetGrid(
       data = titanic,
       row='class',
       col='sex')
 .map(sns.histplot, 'age')
 )

Some geometric object, such as sns.lmplot, does not work well with sns.FacetGrid().
- Instead, sns.lmplot supports mapping, col = 'VAR' and row = 'VAR' to facet the plot.

sns.lmplot(data = tips, x = 'total_bill', y = 'tip', hue = 'smoker',
           scatter_kws = {'alpha' : 0.5},
*          col = 'smoker')

Lecture 7

Learning Objectives

Reshaping DataFrames

Reshaping DataFrames

Tidy DataFrames

Reshaping DataFrames

Tidy DataFrames

Reshaping DataFrames

Reshaping DataFrames

Long vs. Wide DataFrames

Reshaping DataFrames

Long vs. Wide DataFrames

Reshaping DataFrames

Long vs. Wide DataFrames

Reshaping DataFrames

melt() and pivot()

Reshaping DataFrames

Make DataFrame Longer with melt()

Reshaping DataFrames

Make DataFrame Longer with melt()

Reshaping DataFrames

Make DataFrame Longer with melt()

Reshaping DataFrames

Make DataFrame Longer with melt()

Reshaping DataFrames

Make DataFrame Wider with pivot()

Reshaping DataFrames

Reshaping DataFrames

Reshaping DataFrames

Joining DataFrames

Joining DataFrames

Relational Data

Joining DataFrames

Relational Data

Joining DataFrames with merge()

Joining DataFrames with merge()

Joining DataFrames with merge()

Duplicate keys

Joining DataFrames with merge()

Defining the key columns

Joining DataFrames with merge()

Visualizing DataFrames with seaborn

Visualizing DataFrames with seaborn

Visualizing DataFrames with seaborn

Key Points in Data Visualization

Visualizing DataFrames with seaborn

Getting started with seaborn

Visualizing DataFrames with seaborn

Categorical vs. Continuous Variables

Visualizing DataFrames with seaborn

Making Discoveries from a Data Set

Visualizing DataFrames with seaborn

Data Visualization Workflow

Visualizing DataFrames with seaborn

Getting started with seaborn

Visualizing DataFrames with seaborn

Bar Chart

Visualizing DataFrames with seaborn

Bar Chart

Visualizing DataFrames with seaborn

Histogram

Visualizing DataFrames with seaborn

Histogram

Visualizing DataFrames with seaborn

Scatter plot

Visualizing DataFrames with seaborn

Scatter plot

Visualizing DataFrames with seaborn

Fitted line

Visualizing DataFrames with seaborn

Transparency with alpha

Visualizing DataFrames with seaborn

Scatter plot

Visualizing DataFrames with seaborn

Line cahrt

Visualizing DataFrames with seaborn

Line cahrt

Visualizing DataFrames with seaborn

Faceting

Reshaping `DataFrames`

Reshaping `DataFrames`

Tidy `DataFrames`

Reshaping `DataFrames`

Tidy `DataFrames`

Reshaping `DataFrames`

Reshaping `DataFrames`

Long vs. Wide `DataFrames`

Reshaping `DataFrames`

Long vs. Wide `DataFrames`

Reshaping `DataFrames`

Long vs. Wide `DataFrames`

Reshaping `DataFrames`

`melt()` and `pivot()`

Reshaping `DataFrames`

Make `DataFrame` Longer with `melt()`

Reshaping `DataFrames`

Make `DataFrame` Longer with `melt()`

Reshaping `DataFrames`

Make `DataFrame` Longer with `melt()`

Reshaping `DataFrames`

Make `DataFrame` Longer with `melt()`

Reshaping `DataFrames`

Make `DataFrame` Wider with `pivot()`

Reshaping `DataFrames`

Reshaping `DataFrames`

Reshaping `DataFrames`

Joining `DataFrames`

Joining `DataFrames`

Joining `DataFrames`

Joining `DataFrames` with `merge()`

Joining `DataFrames` with `merge()`

Joining `DataFrames` with `merge()`

Joining `DataFrames` with `merge()`

Joining `DataFrames` with `merge()`

Visualizing `DataFrames` with `seaborn`

Visualizing `DataFrames` with `seaborn`

Visualizing `DataFrames` with `seaborn`

Visualizing `DataFrames` with `seaborn`

Getting started with `seaborn`

Visualizing `DataFrames` with `seaborn`

Visualizing `DataFrames` with `seaborn`

Visualizing `DataFrames` with `seaborn`

Visualizing `DataFrames` with `seaborn`

Getting started with `seaborn`

Visualizing `DataFrames` with `seaborn`

Visualizing `DataFrames` with `seaborn`

Visualizing `DataFrames` with `seaborn`

Visualizing `DataFrames` with `seaborn`

Visualizing `DataFrames` with `seaborn`

Visualizing `DataFrames` with `seaborn`

Visualizing `DataFrames` with `seaborn`

Visualizing `DataFrames` with `seaborn`

Transparency with `alpha`

Visualizing `DataFrames` with `seaborn`

Visualizing `DataFrames` with `seaborn`

Visualizing `DataFrames` with `seaborn`

Visualizing `DataFrames` with `seaborn`