Reshaping DataFrames
; Joining DataFrames
; Visualizing DataFrames
March 19, 2024
DataFrames
DataFrames
DataFrames
DataFrames
DataFrames
DataFrames
DataFrame
of patient information, each observation could correspond to a single patient’s data record.DataFrames
DataFrames
DataFrame
tidy:
DataFrames
import pandas as pd
# Below is for an interactive display of DataFrame in Colab
from google.colab import data_table
data_table.enable_dataframe_formatter()
A DataFrame
can be given in a format unsuited for the analysis that we would like to perform on it.
DataFrame
may have larger structural problems that extend beyond the data.DataFrame
stores its values in a format that makes it easy to extract a single row but difficult to aggregate the data.Reshaping a DataFrame
means manipulating it into a different shape.
In this section, we will discuss pandas techniques for molding DataFrame
into the shapes we desire.
DataFrames
DataFrames
DataFrames
measure temperatures in two cities over two days.df_wide = pd.DataFrame({
'Weekday': ['Tuesday', 'Wednesday'],
'Miami': [80, 83],
'Rochester': [57, 62],
'St. Louis': [71, 75]
})
df_long = pd.DataFrame({
'Weekday': ['Tuesday', 'Wednesday', 'Tuesday', 'Wednesday', 'Tuesday', 'Wednesday'],
'City': ['Miami', 'Miami', 'Rochester', 'Rochester', 'St. Louis', 'St. Louis'],
'Temperature': [80, 83, 57, 62, 71, 75]
})
DataFrames
DataFrames
DataFrame
can store its values in wide or long format.DataFrame
increases in height.DataFrame
increases in width.DataFrames
DataFrames
DataFrame
depends on the insight we are trying to glean from it.
DataFrames
longer if one variable is spread across multiple columns.DataFrames
wider if one observation is spread across multiple rows.DataFrames
melt()
and pivot()
melt()
makes DataFrame
longer.pivot()
and pivot_table()
make DataFrame
wider.DataFrames
DataFrame
Longer with melt()
DataFrames
DataFrame
Longer with melt()
melt()
can take a few parameters:
id_vars
is a container (string
, list
, tuple
, or array
) that represents the variables that will remain as is.id_vars
can indicate which column should be the “identifier”.DataFrames
DataFrame
Longer with melt()
df_wide_to_long = (
df_wide
.melt(id_vars = "Weekday",
var_name = "City",
value_name = "Temperature")
)
melt()
can take a few parameters:
var_name
is a string
for the new column name for the variable.value_name
is a string
for the new column name that represents the values for the var_name
.DataFrames
DataFrame
Longer with melt()
df_wide_to_long = (
df_wide
.melt(id_vars = "Weekday",
var_name = "City",
value_name = "Temperature",
value_vars = ['Miami', 'Rochester'])
)
melt()
can take a few parameters:
value_vars
parameter allows us to select which specific columns we want to “melt”.id_vars
parameter.DataFrames
DataFrame
Wider with pivot()
df_long_to_wide = (
df_long
.pivot(index = "Weekday",
columns = "City",
values = "Temperature" # To avoid having MultiIndex
)
.reset_index()
)
pivot()
, we need to specify a few parameters:
index
that takes the column to pivot on;columns
that takes the column to be used to make the new variable names of the wider DataFrame
;values
that takes the column that provides the values of the variables in the wider DataFrame
.DataFrames
DataFrame
, df
, containing information about the number of courses each student took from each department in each year.dict_data = {"Name": ["Donna", "Donna", "Mike", "Mike"],
"Department": ["ECON", "DANL", "ECON", "DANL"],
"2018": [1, 2, 3, 1],
"2019": [2, 3, 4, 2],
"2020": [5, 1, 2, 2]}
df = pd.DataFrame(dict_data)
df_longer = df.melt(id_vars=["Name", "Department"],
var_name="Year",
value_name="Number of Courses")
pivot_table()
method can take both a string
and a list
of variables for the index
parameter.
pivot()
can take only a string
for index
.DataFrames
DataFrame
, df
, containing information about the number of courses each student took from each department in each year.dict_data = {"Name": ["Donna", "Donna", "Mike", "Mike"],
"Department": ["ECON", "DANL", "ECON", "DANL"],
"2018": [1, 2, 3, 1],
"2019": [2, 3, 4, 2],
"2020": [5, 1, 2, 2]}
df = pd.DataFrame(dict_data)
df_longer = df.melt(id_vars=["Name", "Department"],
var_name="Year",
value_name="Number of Courses")
Q. How can we use the df_longer
to create the wide-form DataFrame
, df_wider
, which is equivalent to the df
?
DataFrames
Let’s do Part 1 of Classwork 7!
DataFrames
DataFrames
DataFrame
for county-level data and DataFrame
for geographic information, such as longitude and latitude.DataFrames
based on common data values in those DataFrames
.
merge()
method in Pandas.DataFrames
DataFrames
with merge()
DataFrames
with merge()
x
.
y
.
DataFrames
with merge()
DataFrame
has duplicate keys (a one-to-many relationship).
DataFrames
with merge()
left_on
and right_on
parameters instead.DataFrames
with merge()
Let’s do Part 2 of Classwork 7!
DataFrames
with seaborn
DataFrames
with seaborn
Graphs and charts let us explore and learn about the structure of the information we have in DataFrame
.
Good data visualizations make it easier to communicate our ideas and findings to other people.
DataFrames
with seaborn
A graphic should display as much information as it can, with the lowest possible cognitive strain to the viewer.
Strive for clarity.
Visualization is an iterative process.
DataFrames
with seaborn
seaborn
seaborn
is a Python data visualization library based on matplotlib
.
DataFrames
with seaborn
category
or string
(object
in DataFrame
).float
.
astype('float')
to make a variable float
.For data visualization, integer
-type variables could be treated as either categorical or continuous, depending on the context of analysis.
If the values of an integer-type variable means an intensity or an order, the integer variable could be continuous.
If not, the integer variable is categorical.
We can use astype('int')
to make a variable int
.
DataFrames
with seaborn
From the plots with two or more variables, we want to see co-variation, the tendency for the values of two or more variables to vary together in a related way.
What type of co-variation occurs between variables?
DataFrame.describe()
or DataFrameGroupBy.describe()
to know:
DataFrames
with seaborn
Read the descriptions of variables in a DataFrame if available.
Check the unit of an observation: Are all values in one single observation measured for:
float
, int
, datetime64
, object
, category
).astype(DTYPE)
if needed.sns.histplot
, sns.scatterplot
)(x = , y = , color = , hue = )
)FacetGrid(DATA, row = , col = ).map(GEOMETRIC_OBJECT, VARIABLES)
or col
/row
in the function of geometric object)Pay attention to the unit of x
andy
axes.
If needed, transform a given DataFrame
(e.g., subset of observations, new variables, summarized DataFrame).
DataFrames
with seaborn
seaborn
DataFrames
with seaborn
data
: DataFrame.x
: Name of a categorical variable (column) in DataFramesns.countplot()
function to plot a bar chartDataFrames
with seaborn
hue
: Name of a categorical variableWe can further break up the bars in the bar chart based on another categorical variable.
DataFrames
with seaborn
bins
: Number of binsbinwidth
: Width of each binsns.histplot()
function to plot a histogramDataFrames
with seaborn
sns.histplot()
function to plot a histogram.DataFrames
with seaborn
x
: Name of a continuous variable on the horizontal axisy
: Name of a continuous variable on the vertical axisA scatter plot is used to display the relationship between two continuous variables.
We use sns.scatterplot()
function to plot a scatter plot.
DataFrames
with seaborn
To the scatter plot, we can add a hue
-VARIABLE
mapping to display how the relationship between two continuous variables varies by VARIABLE
.
Suppose we are interested in the following question:
DataFrames
with seaborn
From the scatter plot, it is often difficult to clearly see the relationship between two continuous variables.
sns.lmplot()
adds a line that fits well into the scattered points.
On average, the fitted line describes the relationship between two continuous variables.
DataFrames
with seaborn
alpha
alpha
helps address many data points on the same location.
alpha
to number between 0 and 1.DataFrames
with seaborn
To the scatter plot, we can add a hue
-VARIABLE
mapping to display how the relationship between two continuous variables varies by VARIABLE
.
Using the fitted lines, let’s answer the following question:
DataFrames
with seaborn
x
: Name of a continuous variable (often time variable) on the horizontal axisy
: Name of a continuous variable on the vertical axissns.lineplot()
function to plot a line plot.DataFrames
with seaborn
healthexp = ( sns.load_dataset("healthexp")
.sort_values(["Country", "Year"])
.query("Year <= 2020") )
healthexp.head()
sns.lineplot(data = healthexp,
x = 'Year',
y = 'Life_Expectancy',
hue = 'Country')
DataFrames
with seaborn
First, we create a .FacetGrid()
object with the data we will be using and define how it will be subset with the row
and col
arguments:
Second, we then use the .map()
method to run a plotting function on each of the subsets, passing along any necessary arguments.