Lecture 13

pandas Basics - Joining DataFrames; Concatenating DataFrames

Byeong-Hak Choe

SUNY Geneseo

February 26, 2025

Joining DataFrames

Joining DataFrames

Relational Data

  • Sometimes, one data set is scattered across multiple files.
    • The size of the files can be huge.
    • The data collection process can be scattered across time and space.
    • E.g., DataFrame for county-level data and DataFrame for geographic information, such as longitude and latitude.
  • Sometimes we want to combine two or more DataFrames based on common data values in those DataFrames.
    • This task is known in the database world as performing a “join.”
    • We can do this with the merge() method in Pandas.

Joining DataFrames

Join, Relational Data, and Keys

  • The variables that are used to connect each pair of DataFrames are called keys.
  • Each observation in a DataFrame is often uniquely identified by key variable(s).
  • The key variable enables relationships between the DataFrames to be defined.

Joining DataFrames with merge()

x = pd.DataFrame({
    'key': [1, 2, 3],
    'val_x': ['x1', 'x2', 'x3']
})
y = pd.DataFrame({
    'key': [1, 2, 4],
    'val_y': ['y1', 'y2', 'y3']
})
  • The colored column represents the “key” variable.
  • The grey column represents the “value” variable.

Joining DataFrames with merge()

Inner Join

  • An inner join matches pairs of observations whenever their keys are equal:

# the default value for 'how' is 'inner'
# so it doesn't actually need to be specified
merge_inner_1 = pd.merge(x, y, on='key')
merge_inner_2 = pd.merge(x, y, on='key', how='inner')
merge_inner_3 = x.merge(y, on='key')
merge_inner_4 = x.merge(y, on='key', how='inner')

Joining DataFrames with merge()

Left Join

  • A left join keeps all observations in x.

merge_left = pd.merge(x, y, on='key', how='left')
merge_left_x = x.merge(y, on='key', how='left')
  • The most commonly used join is the left join.

Joining DataFrames with merge()

Right Join

  • A right join keeps all observations in y.

merge_right = pd.merge(x, y, on='key', how='right')
merge_right_x = x.merge(y, on='key', how='right')

Joining DataFrames with merge()

Outer (Full) Join

  • A full join keeps all observations in x and y.

merge_outer = pd.merge(x, y, on='key', how='outer')
merge_outer_x = x.merge(y, on='key', how='outer')

Joining DataFrames with merge()

Duplicate keys: one-to-many

  • One DataFrame has duplicate keys (a one-to-many relationship).

x = pd.DataFrame({
    'key':[1, 2, 2, 3],
    'val_x':['x1', 'x2', 'x3', 'x4']})
y = pd.DataFrame({
    'key':[1, 2],
    'val_y':['y1', 'y2'] })
one_to_many = x.merge(y, on='key', 
                         how='left')

Joining DataFrames with merge()

Duplicate keys: many-to-many

  • Both DataFrames have duplicate keys (many-to-many relationship).

x = pd.DataFrame({
  'key':[1, 2, 2, 3],
  'val_x':['x1','x2','x3','x4']})
y = pd.DataFrame({
  'key': [1, 2, 2, 3],
  'val_y': ['y1', 'y2', 'y3', 'y4'] })
many_to_many = x.merge(y, on='key', 
                          how='left')
  • In practice, it is better to avoid the many-to-many join.

Joining DataFrames with merge()

Defining the key columns

  • If the left and right columns do not have the same name for the key variables, we can use the left_on and right_on parameters instead.
x = pd.DataFrame({
  'key_x': [1, 2, 3],
  'val_x': ['x1', 'x2', 'x3']
})
y = pd.DataFrame({
  'key_y': [1, 2],
  'val_y': ['y1', 'y2'] })

keys_xy = 
  x.merge(y, left_on = 'key_x', 
             right_on = 'key_y', 
             how = 'left')

Joining DataFrames with merge()

Let’s do Part 2 of Classwork 7!

Data Concatenation

Data Concatenation

  • Concatenation involves combining multiple DataFrames by adding rows or columns. This method is useful:
    • When merging datasets that were split into parts;
    • When appending new data to an existing dataset.
  • Let’s consider the following example DataFrames:
df1 = pd.read_csv('https://bcdanl.github.io/data/concat_1.csv')
df2 = pd.read_csv('https://bcdanl.github.io/data/concat_2.csv')
df3 = pd.read_csv('https://bcdanl.github.io/data/concat_3.csv')
  • We will be working with .index and .columns in this Section.
df1.index
df1.columns

Data Concatenation

Add Rows

  • Concatenating the DataFrames on top of each other uses the concat() method.
row_concat = pd.concat([df1, df2, df3])
  • All of the DataFrames to be concatenated are passed in a list.
    • pd.concat( [DataFrame_1, DataFrame_2, ... , DataFrame_N] )

Data Concatenation

Add Columns

  • Concatenating columns is very similar to concatenating rows.
    • The main difference is the axis parameter in the concat() method.
    • The default value of axis is 0 (or axis = "index"), so it will concatenate data in a row-wise fashion.
    • If we pass axis = 1 (or axis = "columns") to the function, it will concatenate data in a column-wise manner.
pd.concat([df1, df2, df3], axis = "columns")
pd.concat([df1, df2, df3], axis = "columns", ignore_index = True)  
  • ignore_index=True to reset the column indices

Data Concatenation

Add a Series?

  • Let’s consider a new Series and concatenate it with df1:
# create a new row of data
new_row_series = pd.Series(['n1', 'n2', 'n3', 'n4'])

# attempt to add the new row to a DataFrame
df = pd.concat([df1, new_row_series])
  • Not only did our code not append the values as a row, but it also created a new column completely misaligned with everything else.
  • Why?

Data Concatenation

Add a Series?

  • To fix the problem, we need turn our Series into a DataFrame.
    • This DataFrame contains one row of data, and the column names are the ones the data will bind to.
new_row_df = pd.DataFrame(
  # note the double brackets to create a "row" of data
  data =[ ["n1", "n2", "n3", "n4"] ],
  columns = df1.columns,
)

df = pd.concat([df1, new_row_df])
  • How about this?
    • pd.concat([df1, new_row_series], axis = 1)

:::

:::

Data Concatenation

Let’s do Part 3 of Classwork 7!