Lecture 7

Reshaping DataFrames; Joining DataFrames; Visualizing DataFrames

Byeong-Hak Choe

SUNY Geneseo

March 19, 2024

Learning Objectives

  • Reshaping DataFrames
  • Joining DataFrames
  • Visualizing DataFrames

Reshaping DataFrames

Reshaping DataFrames

Tidy DataFrames

  • A variable is a quantity, quality, or property that we can measure/count.
    • It could be anything from a person’s age, the temperature on a given day, or the sales figures for a product.
  • An observation represents a unit of analysis.
    • It could be a set of measurements made under similar conditions (e.g, similar unit of entity, time, and/or geography).
    • E.g., In a DataFrame of patient information, each observation could correspond to a single patient’s data record.
  • A value is the actual data point corresponding to the variable and observation.

Reshaping DataFrames

Tidy DataFrames

  • There are three interrelated rules that make a DataFrame tidy:
    • Each variable is a column; each column is a variable.
    • Each observation is a row; each row is an observation.
    • Each value is a cell; each cell is a single value.

Reshaping DataFrames

import pandas as pd
# Below is for an interactive display of DataFrame in Colab
from google.colab import data_table
data_table.enable_dataframe_formatter()
  • A DataFrame can be given in a format unsuited for the analysis that we would like to perform on it.

    • A DataFrame may have larger structural problems that extend beyond the data.
    • Perhaps the DataFrame stores its values in a format that makes it easy to extract a single row but difficult to aggregate the data.
  • Reshaping a DataFrame means manipulating it into a different shape.

  • In this section, we will discuss pandas techniques for molding DataFrame into the shapes we desire.

Reshaping DataFrames

Long vs. Wide DataFrames

  • The following DataFrames measure temperatures in two cities over two days.
df_wide = pd.DataFrame({
    'Weekday': ['Tuesday', 'Wednesday'],
    'Miami': [80, 83],
    'Rochester': [57, 62],
    'St. Louis': [71, 75]
})

df_long = pd.DataFrame({
    'Weekday': ['Tuesday', 'Wednesday', 'Tuesday', 'Wednesday', 'Tuesday', 'Wednesday'],
    'City': ['Miami', 'Miami', 'Rochester', 'Rochester', 'St. Louis', 'St. Louis'],
    'Temperature': [80, 83, 57, 62, 71, 75]
})

Reshaping DataFrames

Long vs. Wide DataFrames

  • A DataFrame can store its values in wide or long format.
  • These names reflect the direction in which the data set expands as we add more values to it.
    • A long DataFrame increases in height.
    • A wide DataFrame increases in width.

Reshaping DataFrames

Long vs. Wide DataFrames

  • The optimal storage format for a DataFrame depends on the insight we are trying to glean from it.
    • We consider making DataFrames longer if one variable is spread across multiple columns.
    • We consider making DataFrames wider if one observation is spread across multiple rows.

Reshaping DataFrames

melt() and pivot()

  • melt() makes DataFrame longer.
  • pivot() and pivot_table() make DataFrame wider.

Reshaping DataFrames

Make DataFrame Longer with melt()

df_wide_to_long = (
    df_wide
    .melt()
)


Reshaping DataFrames

Make DataFrame Longer with melt()

df_wide_to_long = (
    df_wide
    .melt(id_vars = "Weekday")
)


  • melt() can take a few parameters:
    • id_vars is a container (string, list, tuple, or array) that represents the variables that will remain as is.
    • id_vars can indicate which column should be the “identifier”.

Reshaping DataFrames

Make DataFrame Longer with melt()

df_wide_to_long = (
    df_wide
    .melt(id_vars = "Weekday",
          var_name = "City",
          value_name = "Temperature")
)
  • melt() can take a few parameters:
    • var_name is a string for the new column name for the variable.
    • value_name is a string for the new column name that represents the values for the var_name.

Reshaping DataFrames

Make DataFrame Longer with melt()

df_wide_to_long = (
    df_wide
    .melt(id_vars = "Weekday",
          var_name = "City",
          value_name = "Temperature",
          value_vars = ['Miami', 'Rochester'])
)
  • melt() can take a few parameters:
    • value_vars parameter allows us to select which specific columns we want to “melt”.
    • By default, it will melt all the columns not specified in the id_vars parameter.

Reshaping DataFrames

Make DataFrame Wider with pivot()

df_long_to_wide = (
    df_long
    .pivot(index = "Weekday",
           columns = "City",
           values = "Temperature"  # To avoid having MultiIndex
        )
    .reset_index()
    )
  • When using pivot(), we need to specify a few parameters:
    • index that takes the column to pivot on;
    • columns that takes the column to be used to make the new variable names of the wider DataFrame;
    • values that takes the column that provides the values of the variables in the wider DataFrame.

Reshaping DataFrames

  • Let’s consider the following wide-form DataFrame, df, containing information about the number of courses each student took from each department in each year.
dict_data = {"Name": ["Donna", "Donna", "Mike", "Mike"],
             "Department": ["ECON", "DANL", "ECON", "DANL"],
             "2018": [1, 2, 3, 1],
             "2019": [2, 3, 4, 2],
             "2020": [5, 1, 2, 2]}
df = pd.DataFrame(dict_data)

df_longer = df.melt(id_vars=["Name", "Department"], 
                    var_name="Year", 
                    value_name="Number of Courses")
  • The pivot_table() method can take both a string and a list of variables for the index parameter.
    • The pivot() can take only a string for index.

Reshaping DataFrames

  • Let’s consider the following wide-form DataFrame, df, containing information about the number of courses each student took from each department in each year.
dict_data = {"Name": ["Donna", "Donna", "Mike", "Mike"],
             "Department": ["ECON", "DANL", "ECON", "DANL"],
             "2018": [1, 2, 3, 1],
             "2019": [2, 3, 4, 2],
             "2020": [5, 1, 2, 2]}
df = pd.DataFrame(dict_data)

df_longer = df.melt(id_vars=["Name", "Department"], 
                    var_name="Year", 
                    value_name="Number of Courses")

Q. How can we use the df_longer to create the wide-form DataFrame, df_wider, which is equivalent to the df?

Reshaping DataFrames

Let’s do Part 1 of Classwork 7!

Joining DataFrames

Joining DataFrames

Relational Data

  • Sometimes, one data set is scattered across multiple files.
    • The size of the files can be huge.
    • The data collection process can be scattered across time and space.
    • E.g., DataFrame for county-level data and DataFrame for geographic information, such as longitude and latitude.
  • Sometimes we want to combine two or more DataFrames based on common data values in those DataFrames.
    • This task is known in the database world as performing a “join.”
    • We can do this with the merge() method in Pandas.

Joining DataFrames

Relational Data

  • The variables that are used to connect each pair of tables are called keys.

Joining DataFrames with merge()

x = pd.DataFrame({
    'key': [1, 2, 3],
    'val_x': ['x1', 'x2', 'x3']
})
y = pd.DataFrame({
    'key': [1, 2, 4],
    'val_y': ['y1', 'y2', 'y3']
})
  • The colored column represents the “key” variable.
  • The grey column represents the “value” column.

Joining DataFrames with merge()

  • An inner join matches pairs of observations whenever their keys are equal:

# the default value for 'how' is 'inner'
# so it doesn't actually need to be specified
merge_inner = pd.merge(x, y, on='key', how='inner')
merge_inner_x = x.merge(y, on='key', how='inner')
merge_inner_x_how = x.merge(y, on='key')
  • A left join keeps all observations in x.

merge_left = pd.merge(x, y, on='key', how='left')
merge_left_x = x.merge(y, on='key', how='left')
  • The most commonly used join is the left join.
  • A right join keeps all observations in y.

merge_right = pd.merge(x, y, on='key', how='right')
merge_right_x = x.merge(y, on='key', how='right')
  • A full join keeps all observations in x and y.

merge_outer = pd.merge(x, y, on='key', how='outer')
merge_outer_x = x.merge(y, on='key', how='outer')

Joining DataFrames with merge()

Duplicate keys

  • One DataFrame has duplicate keys (a one-to-many relationship).

x = pd.DataFrame({
    'key':[1, 2, 2, 3],
    'val_x':['x1', 'x2', 'x3', 'x4']})
y = pd.DataFrame({
    'key':[1, 2],
    'val_y':['y1', 'y2'] })
one_to_many = x.merge(y, on='key', 
                         how='left')
  • Both DataFrames have duplicate keys (many-to-many relationship).

x = pd.DataFrame({
  'key':[1, 2, 2, 3],
  'val_x':['x1','x2','x3','x4']})
y = pd.DataFrame({
  'key': [1, 2, 2, 3],
  'val_y': ['y1', 'y2', 'y3', 'y4'] })
many_to_many = x.merge(y, on='key', 
                          how='left')

Joining DataFrames with merge()

Defining the key columns

  • If the left and right columns do not have the same name for the key variables, we can use the left_on and right_on parameters instead.
x = pd.DataFrame({
  'key_x': [1, 2, 3],
  'val_x': ['x1', 'x2', 'x3']
})
y = pd.DataFrame({
  'key_y': [1, 2],
  'val_y': ['y1', 'y2'] })

keys_xy = 
  x.merge(y, left_on = 'key_x', 
             right_on = 'key_y', 
             how = 'left')

Joining DataFrames with merge()

Let’s do Part 2 of Classwork 7!

Visualizing DataFrames with seaborn

Visualizing DataFrames with seaborn

  • Graphs and charts let us explore and learn about the structure of the information we have in DataFrame.

  • Good data visualizations make it easier to communicate our ideas and findings to other people.

Visualizing DataFrames with seaborn

Key Points in Data Visualization

  • A graphic should display as much information as it can, with the lowest possible cognitive strain to the viewer.

  • Strive for clarity.

    • Make the data stand out.
      • Avoid too many superimposed elements (e.g., too many curves in the same graphing space).
      • Avoid having the data all skewed to one side or the other of your graph.
  • Visualization is an iterative process.

    • We should try making data visualization informative as much as we can.

Visualizing DataFrames with seaborn

Getting started with seaborn

  • seaborn is a Python data visualization library based on matplotlib.
    • It allows us to easily create beautiful but complex graphics using a simple interface.
import seaborn as sns
sns.set_theme(rc={'figure.dpi': 600, 
                  'figure.figsize': (5, 3.75)})   # better quality

Visualizing DataFrames with seaborn

Categorical vs. Continuous Variables

  • A categorical variable is a variable whose value is obtained by counting.
    • Number of red marbles in a jar
    • Number of heads when flipping three coins
    • Students’ letter grade
    • US state/county
  • Categorical variables should be category or string (object in DataFrame).
  • A continuous variable is a variable whose value is obtained by measuring and can have a decimal or fractional value.
    • Height/weight of students
    • Time it takes to get to school
    • Fuel efficiency of a vehicle (e.g., miles per gallon)
  • Continuous variables should be float.
    • We can use astype('float') to make a variable float.
  • For data visualization, integer-type variables could be treated as either categorical or continuous, depending on the context of analysis.

  • If the values of an integer-type variable means an intensity or an order, the integer variable could be continuous.

    • A variable of age integers (18, 19, 20, 21, …) could be continuous.
    • A variable of integer-valued MPG (27, 28, 29, 30, …) could be continuous.
  • If not, the integer variable is categorical.

    • A variable of month integers (1, ,2, …, 12) is categorical.
  • We can use astype('int') to make a variable int.

Visualizing DataFrames with seaborn

Making Discoveries from a Data Set

  • From the distribution plots, we want to see the variation, the tendency of the values of a variable to change.
    • Which values are the most common? Why?
    • How normal is the data? How many peaks are there in the distribution?
    • Which values are rare? Why? Does that match our expectations?
    • Can we see any unusual patterns? What might explain them?
  • From the plots with two or more variables, we want to see co-variation, the tendency for the values of two or more variables to vary together in a related way.

  • What type of co-variation occurs between variables?

    • Are they positively associated?
    • Are they negatively associated?
    • Are there no association between them?
  • A distribution of a categorical variable (e.g., bar charts and more)
  • A distribution of a continuous variable (e.g., histograms and more)
  • A relationship between two categorical variables (e.g., bar charts and more)
  • A relationship between two continuous variables (e.g., scatter plots and more)
  • A relationship between a categorical variable and a continuous variable (e.g., boxplots and more)
  • A time trend of a variable (e.g., line plots and more)
  • Use DataFrame.describe() or DataFrameGroupBy.describe() to know:
    • Mean (Average, Expected Value);
    • Standard Deviation (SD)
    • Minimum, First Quartile (Q1), Median (Q2), Third Quartile (Q3), and Maximum.

Visualizing DataFrames with seaborn

Data Visualization Workflow

  • Read the descriptions of variables in a DataFrame if available.

  • Check the unit of an observation: Are all values in one single observation measured for:

    • One organization (e.g., university, company)?
    • One geographic unit (e.g., zip code, county, state, country)?
    • One person?
    • One person in a specific point of time (e.g., year, month, week)?
    • One geographic unit in a specific point of time?
    • One organization in a specific point of time?
  • Figure out whether variables of interests are categorical or continuous.
    • Check what types of the variables are (e.g., float, int, datetime64, object, category).
    • Use astype(DTYPE) if needed.
  • Choose appropriate
    • Geometric objects (e.g., sns.histplot, sns.scatterplot)
    • Aesthetic mappings ((x = , y = , color = , hue = ))
    • Faceting (FacetGrid(DATA, row = , col = ).map(GEOMETRIC_OBJECT, VARIABLES) or col/row in the function of geometric object)
  • Pay attention to the unit of x andy axes.

  • If needed, transform a given DataFrame (e.g., subset of observations, new variables, summarized DataFrame).

Visualizing DataFrames with seaborn

Getting started with seaborn

  • Let’s get the names of DataFrames provided by the seaborn library:
import seaborn as sns
sns.get_dataset_names() 
  • Let’s use the titanic and tips DataFrames:
titanic = sns.load_dataset('titanic')
tips = sns.load_dataset('tips')

Visualizing DataFrames with seaborn

Bar Chart

sns.countplot(data = titanic,
              x =  'sex')
  • Mapping
    • data: DataFrame.
    • x: Name of a categorical variable (column) in DataFrame
  • A bar chart is used to plot the frequency of the different categories.
    • It is useful to visualize how values of a categorical variable are distributed.
    • A variable is categorical if it can only take one of a small set of values.
  • We use sns.countplot() function to plot a bar chart

Visualizing DataFrames with seaborn

Bar Chart

sns.countplot(data = titanic,
              x = 'sex',
              hue = 'survived')
  • Mapping
    • hue: Name of a categorical variable
  • We can further break up the bars in the bar chart based on another categorical variable.

    • This is useful to visualize the relationship between the two categorical variables.

Visualizing DataFrames with seaborn

Histogram

sns.histplot(data = titanic,
             x =  'age', 
             bins = 5)
  • Mapping
    • bins: Number of bins
    • binwidth: Width of each bin
  • A histogram is a continuous version of a bar chart.
    • It is used to plot the frequency of the different values.
    • It is useful to visualize how values of a continuous variable are distributed.
  • We use sns.histplot() function to plot a histogram

Visualizing DataFrames with seaborn

Histogram

sns.boxplot(data = tips,
            y = 'total_bill')
sns.boxplot(data = tips,
            x = 'time', 
            y = 'total_bill')
  • A boxplot computes a summary of the distribution and then display a specially formatted box.
    • It is useful to visualize how values of a continuous variable are distributed across different values of another (categorical) variable.
  • We use sns.histplot() function to plot a histogram.

Visualizing DataFrames with seaborn

Scatter plot

sns.scatterplot(data = tips,
                x = 'total_bill', 
                y = 'tip')
  • Mapping
    • x: Name of a continuous variable on the horizontal axis
    • y: Name of a continuous variable on the vertical axis
  • A scatter plot is used to display the relationship between two continuous variables.

    • We can see co-variation as a pattern in the scattered points.
  • We use sns.scatterplot() function to plot a scatter plot.

Visualizing DataFrames with seaborn

Scatter plot

sns.scatterplot(data = tips,
                x = 'total_bill', 
                y = 'tip',
                hue = 'smoker')
  • To the scatter plot, we can add a hue-VARIABLE mapping to display how the relationship between two continuous variables varies by VARIABLE.

  • Suppose we are interested in the following question:

    • Q. Does a smoker and a non-smoker have a difference in tipping behavior?

Visualizing DataFrames with seaborn

Fitted line

sns.lmplot(data = tips,
           x = 'total_bill', 
           y = 'tip')
  • From the scatter plot, it is often difficult to clearly see the relationship between two continuous variables.

    • sns.lmplot() adds a line that fits well into the scattered points.

    • On average, the fitted line describes the relationship between two continuous variables.

Visualizing DataFrames with seaborn

Transparency with alpha

sns.scatterplot(data = tips,
                x = 'total_bill', 
                y = 'tip',
                alpha = .25)
sns.lmplot(data = tips,
           x = 'total_bill', 
           y = 'tip',
           scatter_kws = {'alpha' : 0.2})
  • In a scatter plot, adding transparency with alpha helps address many data points on the same location.
    • We can map alpha to number between 0 and 1.

Visualizing DataFrames with seaborn

Scatter plot

sns.lmplot(data = tips,
           x = 'total_bill', 
           y = 'tip',
           hue = 'smoker',
           scatter_kws = {'alpha' : 0.5})
  • To the scatter plot, we can add a hue-VARIABLE mapping to display how the relationship between two continuous variables varies by VARIABLE.

  • Using the fitted lines, let’s answer the following question:

    • Q. Does a smoker and a non-smoker have a difference in tipping behavior?

Visualizing DataFrames with seaborn

Line cahrt

path_csv = 'https://bcdanl.github.io/data/dji.csv'
dow = pd.read_csv(path_csv)
dow['Date'] = dow['Date'].astype('datetime64')

sns.lineplot(data = dow,
             x =  'Date', 
             y =  'Close')
  • Mapping
    • x: Name of a continuous variable (often time variable) on the horizontal axis
    • y: Name of a continuous variable on the vertical axis
  • A line chart is used to display the trend in a continuous variable or the change in a continuous variable over other variable.
    • It draws a line by connecting the scattered points in order of the variable on the x-axis, so that it highlights exactly when changes occur.
  • We use sns.lineplot() function to plot a line plot.

Visualizing DataFrames with seaborn

Line cahrt

healthexp = ( sns.load_dataset("healthexp")
             .sort_values(["Country", "Year"])
             .query("Year <= 2020") )
healthexp.head()

sns.lineplot(data = healthexp,
             x =  'Year', 
             y =  'Life_Expectancy',
             hue = 'Country')
  • For line charts, we often need to group or connect observations to visualize the number of distinct lines.

Visualizing DataFrames with seaborn

Faceting

  • Faceting allows us to plot subsets (facets) of our data across subplots.
  • First, we create a .FacetGrid() object with the data we will be using and define how it will be subset with the row and col arguments:

  • Second, we then use the .map() method to run a plotting function on each of the subsets, passing along any necessary arguments.

(
 sns.FacetGrid(
       data = titanic,
       row='class',
       col='sex')
 .map(sns.histplot, 'age')
 )
  • Some geometric object, such as sns.lmplot, does not work well with sns.FacetGrid().
    • Instead, sns.lmplot supports mapping, col = 'VAR' and row = 'VAR' to facet the plot.
sns.lmplot(data = tips, x = 'total_bill', y = 'tip', hue = 'smoker',
           scatter_kws = {'alpha' : 0.5},
*          col = 'smoker')