pandas
Basics - Reshaping DataFrames; Joining DataFrames; Data Concatenation
February 24, 2025
DataFrames
DataFrames
DataFrames
DataFrame
tidy:
DataFrames
A DataFrame
can be given in a format unsuited for the analysis that we would like to perform on it.
DataFrame
may have larger structural problems that extend beyond the data.DataFrame
stores its values in a format that makes it easy to extract a single row but difficult to aggregate the data.Reshaping a DataFrame
means manipulating it into a different shape.
In this section, we will discuss pandas techniques for molding a DataFrame
into the shape we desire.
DataFrames
DataFrames
measure temperatures in two cities over two days.import pandas as pd
from google.colab import data_table
data_table.enable_dataframe_formatter()
df_wide = pd.DataFrame({
'Weekday': ['Tuesday', 'Wednesday'],
'Miami': [80, 83],
'Rochester': [57, 62],
'St. Louis': [71, 75]
})
df_long = pd.DataFrame({
'Weekday': ['Tuesday', 'Wednesday', 'Tuesday', 'Wednesday', 'Tuesday', 'Wednesday'],
'City': ['Miami', 'Miami', 'Rochester', 'Rochester', 'St. Louis', 'St. Louis'],
'Temperature': [80, 83, 57, 62, 71, 75]
})
DataFrames
DataFrame
can store its values in wide or long format.DataFrame
increases in height.DataFrame
increases in width.DataFrames
DataFrame
depends on the insight we are trying to glean from it.
DataFrames
longer if one variable is spread across multiple columns.DataFrames
wider if one observation is spread across multiple rows.DataFrames
melt()
and pivot()
melt()
makes DataFrame
longer.pivot()
makes DataFrame
wider.DataFrame
Longer with melt()
DataFrame
Longer with melt()
melt()
can take a few parameters:
id_vars
is a container (string
, list
, tuple
, or array
) that represents the variables that will remain as is.id_vars
can indicate which column should be the “identifier”.DataFrame
Longer with melt()
df_wide_to_long = (
df_wide
.melt(id_vars = "Weekday",
var_name = "City",
value_name = "Temperature")
)
melt()
can take a few parameters:
var_name
is a string for the name of the variable whose values are taken from column names in a given wide-form DataFrame.value_name
is a string for the name of the variable whose values are taken from the values in a given wide-form DataFrame.DataFrame
Wider with pivot()
df_long_to_wide = (
df_long
.pivot(index = "Weekday",
columns = "City",
values = "Temperature"
)
.reset_index()
)
pivot()
, we need to specify a few parameters:
index
that takes the column to pivot on;columns
that takes the column to be used to make the variable names of the wider DataFrame
;values
that takes the column that provides the values of the variables in the wider DataFrame
.DataFrames
DataFrame
, df
, containing information about the number of courses each student took from each department in each year.dict_data = {"Name": ["Donna", "Donna", "Mike", "Mike"],
"Department": ["ECON", "DANL", "ECON", "DANL"],
"2018": [1, 2, 3, 1],
"2019": [2, 3, 4, 2],
"2020": [5, 1, 2, 2]}
df = pd.DataFrame(dict_data)
df_longer = df.melt(id_vars=["Name", "Department"],
var_name="Year",
value_name="Number of Courses")
pivot()
method can also take a list
of variable names for the index
parameter.DataFrames
DataFrame
, df
, containing information about the number of courses each student took from each department in each year.dict_data = {"Name": ["Donna", "Donna", "Mike", "Mike"],
"Department": ["ECON", "DANL", "ECON", "DANL"],
"2022": [1, 2, 3, 1],
"2023": [2, 3, 4, 2],
"2024": [5, 1, 2, 2]}
df = pd.DataFrame(dict_data)
df_longer = df.melt(id_vars=["Name", "Department"],
var_name="Year",
value_name="Number of Courses")
Q. How can we use the df_longer
to create the wide-form DataFrame
, df_wider
, which is equivalent to the df
?
DataFrames
Let’s do Part 1 of Classwork 7!
DataFrames
DataFrames
DataFrame
for county-level data and DataFrame
for geographic information, such as longitude and latitude.DataFrames
based on common data values in those DataFrames
.
merge()
method in Pandas.DataFrames
DataFrames
with merge()
DataFrames
with merge()
DataFrames
with merge()
x
.
DataFrames
with merge()
y
.
DataFrames
with merge()
x
and y
.
DataFrames
with merge()
DataFrame
has duplicate keys (a one-to-many relationship).
DataFrames
with merge()
DataFrames
have duplicate keys (many-to-many relationship).
DataFrames
with merge()
left_on
and right_on
parameters instead.DataFrames
with merge()
Let’s do Part 2 of Classwork 7!
df1 = pd.read_csv('https://bcdanl.github.io/data/concat_1.csv')
df2 = pd.read_csv('https://bcdanl.github.io/data/concat_2.csv')
df3 = pd.read_csv('https://bcdanl.github.io/data/concat_3.csv')
.index
and .columns
in this Section.df1
:# create a new row of data
new_row_series = pd.Series(['n1', 'n2', 'n3', 'n4'])
new_row_series
# attempt to add the new row to a dataframe
df = pd.concat([df1, new_row_series])
df
Series
into a DataFrame
.
axis
parameter in the concat()
method.axis
is 0
(or axis = "index"
), so it will concatenate data in a row-wise fashion.axis = 1
(or axis = "columns"
) to the function, it will concatenate data in a column-wise manner.What would happen when the row and column indices are not aligned?
Let’s modify our DataFrames for the next few examples.
# rename the columns of our dataframes
df1.columns = ['A', 'B', 'C', 'D']
df2.columns = ['E', 'F', 'G', 'H']
df3.columns = ['A', 'C', 'F', 'H']
axis="columns"
(axis=1
), the new DataFrames will be added in a column-wise fashion and matched against their respective row indices.join="inner"
.Let’s do Part 3 of Classwork 7!