Lecture 5

Data Wrangling with tidyr, forcats, and stringr

Byeong-Hak Choe

bchoe@geneseo.edu

SUNY Geneseo

March 2, 2026

Tidy Data

What Are Tidy Data?

There are three rules which make a dataset tidy:
1. Each variable must have its own column.
2. Each observation must have its own row.
3. Each value must have its own cell.

Tidy data

library(tidyverse)
table1
table2
table3

## Spread across two tibbles
table4a  ## cases
table4b

We can represent the same underlying data in multiple ways.

🔄 Pivoting

Pivoting

The first step is always to figure out what form of the variables and observations we need.
The second step is to resolve one of two common problems:
- One variable might be spread across multiple columns.
- One observation might be scattered across multiple rows.
To fix these problems, we may need the two most important functions in the tidyr package: pivot_longer() and pivot_wider().

Longer

table4a

Some of the column names are not names of variables, but values of a variable.
We use pivot_longer() when a variable might be spread across multiple columns.

Longer

table4a |> 
  pivot_longer(cols = c(`1999`, `2000`), 
               names_to = "year", 
               values_to = "cases")

To tidy a dataset like table4a, we need to pivot the offending columns into a new pair of variables.
To use pivot_longer(), we need to the following three parameters:
- The set of columns (cols) whose names are values, not variables.
- The name of the variable to move the column names to (names_to).
- The name of the variable to move the column values to (values_to).

Longer

Wider

table2

One observation might be scattered across multiple rows.
We use pivot_wider() when an observation is scattered across multiple rows.

Wider

table2
table2 |>
    pivot_wider(names_from = type, values_from = count)

To use pivot_wider(), we need two parameters:
- The column to take variable names from.
- The column to take values from.
We can add the optimal parameter, names_prefix, for the prefix of column names.

Wider

✂️ Separating and Uniting

Separating a Column with `separate()`

table3

table3 |> 
  separate(rate, into = c("cases", "population"))

table3 has one column (rate) that contains two variables (cases and population).
separate() takes the name of the column to separate, and the names of the columns to separate into.

Separating a Column with `separate()`

table3 |> 
  separate(rate, into = c("cases", "population"), sep = "/")

If we want to use a specific chracter to seprate a column, we can pass the character to the sep parameter.

Separating a Column with `separate()`

table3 |> 
  separate(rate, into = c("cases", "population"))

By default, separate() leaves the type of the column as is.

Separating a Column with `separate()`

table3 |> 
  separate(year, into = c("century", "year"), sep = 2)

separate() will interpret the integers as positions to split at.
- We can also pass a vector of integers to sep.
- separate() will interpret the integer as positions to split at.

Uniting Columns into a Single Column with `unite()`

table5 |> 
  unite(new, century, year)
  
table5 |> 
  unite(new, century, year, sep = "")

unite() combines multiple columns into a single column.
The default will place an underscore (_) between the values from different columns.

🔗 Relational Data

Relational data

It’s rare that a data analysis involves only a single data frame.
Collectively, multiple data frames are called relational data.
To work with relational data, we need verbs that work with pairs of data frames.
join methods add new variables to one data frame from matching observations in another data frame.

`nycflights13`

library(nycflights13)
flights
airlines
airports
planes
weather

nycflights13 contains four data frames that are related to the data frame, flights, that we used in data transformation.

`nycflights13`

flights connects to …
planes via a single variable, tailnum.
airlines through the carrier variable.
airports in two ways: via the origin and dest variables.
weather via origin (the location), and year, month, day and hour (the time).

Relations in `nycflights13`

Key Variables

A key variable (or a set of key variables) is a variable (or a set of variables) that uniquely identifies an observation.
- So, a key variable (or a set of key variables) is used to connect relational data.frames.
- The name of a key variable can be different across relational data.frames.

Relational Data

A join allows us to combine two data frames via key variables.
- It first matches observations by their keys, then copies across variables from one data.frame to the other.

Tidyverse	SQL	Description
left	left outer	Keep all the observations from the left
right	right outer	Keep all the observations from the right
full	full outer	Keep all the observations from both left and right
inner	inner	Keep only the observations whose key values exist in both

Joins

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     3, "x3" )

y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     4, "y3" )

The colored column represents the “key” variable.
The grey column represents the “value” column.

Joining Tables with `inner_join()`

x |>
  inner_join(y)

An inner join matches pairs of observations whenever their keys are equal:

Joining Tables with `left_join()`

x |>
  left_join(y)

A left join keeps all rows from x and adds matching information from y.
Among the different join types, left_join() is the most commonly used join.
- It does not lose information from your main data.frame (x) and simply attaches extra information (y) when it exists.

Joining Tables with `right_join()`

x |>
  right_join(y)

A right join keeps all observations in y.

Joining Tables with `full_join()`

x |> 
  full_join(y)

A full join keeps all observations in x and y.

Duplicate Keys

Relations between data.frames could be …
- One-to-one;
- One-to-many (e.g., airlines and airports);
- Many-to-one;
- Many-to-many (e.g., flights and airplanes).
One-to-many relation is such that one data.frame has duplicate keys
Many-to-many relation is such that both data.frames have duplicate keys.
- In practice, it is recommended to avoid joining data.frames with many-to-many relation.

Duplicate Keys: One-to-many Join

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     2, "x3",
     1, "x4")

y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2")

x |> 
  left_join(y)

Duplicate Keys: Many-to-many Join

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     2, "x3",
     3, "x4" )

y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     2, "y3",
     3, "y4" )
x |> 
  left_join(y)

Joins: Defining the Key Columns

flights2 |> 
  left_join(weather)
  
flights2 |> 
  left_join(planes, 
            by = "tailnum")

flights2 |> 
  left_join(airports, 
            c("dest" = "faa"))
  
flights2 |> 
  left_join(airports, 
            c("origin" = "faa"))

by = "a": uses only variable a.
by = c("a" = "b"): match variable a in data frame x to variable b in data frame y.

🧩 Data Assembly

df1 <- data.frame(id = 1:3, 
                  name = c("Alice", "Bob", "Charlie")
                  )
df2 <- data.frame(id = 4:6, 
                  name = c("Dave", "Eve", "Frank")
                  )

df_rbind <- rbind(df1, df2)
df_rbind <- bind_rows(df1, df2)
df_cbind <- cbind(df1, df2) 
df_cbind <- bind_cols(df1, df2) ## better to rename id before cbind()

Row-bind (stack rows): `rbind()` vs. `bind_rows()`

rbind() (base R) requires the same columns in the same order (and compatible types). If names/order don’t match, it typically errors.
bind_rows() (dplyr) matches columns by name, and will fill missing columns with NA when the inputs have different sets of columns.
Tip: use bind_rows() when combining files that may have slightly different columns (common in real data).

Column-bind (add columns side-by-side): `cbind()` vs. `bind_cols()`

df1 <- data.frame(id = 1:3, 
                  name = c("Alice", "Bob", "Charlie"))
df2 <- data.frame(id = 4:6, 
                  name = c("Dave", "Eve", "Frank"))
df_cbind <- cbind(df1, df2) 
df_cbind2 <- bind_cols(df1, df2)

cbind() (base R) binds by row position and can silently coerce to a matrix (e.g., mixing numbers + strings) and may recycle shorter vectors in some cases.
bind_cols() (dplyr) also binds by row position, but is stricter about row counts and returns a tibble with safer name handling.
Tip: column-binding is best when the rows are already in the same order; if you need to match by a key (e.g., id), use a join (left_join, inner_join) instead.

Data Assembly: `rbind()` vs. `bind_rows()`

df1 <- data.frame(id1 = 1:3, 
                  name = c("Alice", "Bob", "Charlie")
                  )
df2 <- data.frame(id2 = 4:6, 
                  name = c("Dave", "Eve", "Frank")
                  )

df_rbind <- rbind(df1, df2)
df_rbind2 <- bind_rows(df1, df2)

rbind() typically fails when the two data.frames have different column names.
bind_rows() can still work because it matches columns by name and fills the rest with NA.

Data Assembly: `cbind()` vs. `bind_cols()`

df1 <- data.frame(id1 = 1:5, 
                  name = c("Alice", "Bob", "Charlie", "Dan", "Eva")
                  )
df2 <- data.frame(id2 = 6:8, 
                  name = c("Dave", "Eve", "Frank")
                  )

df_cbind <- cbind(df1, df2)
df_cbind2 <- bind_cols(df1, df2)

Both cbind() and bind_cols() require compatible row counts; otherwise it errors.

Factors

Creating Factors

In R, factors are categorical variables, variables that have a fixed and known set of possible values.

x1 <- c("Dec", "Apr", "Jan", "Mar")

Using a string to record variable x1 has two problems:
1. There are only twelve possible months, and there’s nothing saving us from typos.
2. It doesn’t sort in a useful way.

x2 <- c("Dec", "Apr", "Jam", "Mar")
sort(x1)

Creating Factors with `factor()`

We can fix both of these problems with factor().
To create a factor, we must start by creating a list of the valid levels.
Any values not in the set will be silently converted to NA.
If we omit the levels, they’ll be taken from the data in alphabetical order:

months <- c(
  "Jan", "Feb", "Mar", "Apr", 
  "May", "Jun", "Jul", "Aug", 
  "Sep", "Oct", "Nov", "Dec")
x1 <- 
  c("Dec", "Apr", "Jan", "Mar")
y1 <- factor(x1, 
             levels = months)
sort(y1)

x2 <- 
  c("Dec", "Apr", "Jam", "Mar")

y2 <- factor(x2, 
             levels = months)
y2

factor(x1)

Creating Factors with `factor()`

x1

f1 <- factor(x1, levels = unique(x1))
f1

levels(f1)

Sometimes we’d prefer that the order of the levels match the order of the first appearance in the data.
We can do that when creating the factor by setting levels to unique().
If we ever need to access the set of valid levels directly, we can do so with levels().

Creating Factors with `factor()`: `labels` Option

The labels option in factor() allows us to assign custom display names to the levels.
The labels must be in the same order as the levels.

x1 <- c("Dec", "Apr", "Jan", "Mar")

y3 <- 
  factor(x1,
         levels = c("Jan", "Apr", 
                    "Mar", "Dec"),
         labels = c("January", "April",
                    "March", "December"))
y3

# labels are useful for recoding
x3 <- c(1, 2, 2, 1, 3)

y4 <- factor(x3,
             levels = c(1, 2, 3),
             labels = c("Low", 
                        "Medium", 
                        "High"))
y4
levels(y4)

Without labels, factor levels display as-is; with labels, we control how they appear in outputs and plots.

Data: General Social Survey

gss_cat

gss_cat |>
  [?](race)

We’re going to focus on the data frame, forcats::gss_cat.which is a sample of data from the General Social Survey.
When factors are stored in a data frame, we can see them with count().

Modifying Factor Order

relig_summary <- gss_cat |>
  group_by([?]) |>
  summarize(
    age = [?](age, [?]),
    tvhours = [?](tvhours, [?]),
    n = [?]
  )

ggplot(relig_summary, 
       aes(tvhours, relig)) + 
  [?]()

It’s often useful to change the order of the factor levels in a visualization.
Imagine we want to explore the average number of hours spent watching TV per day across relig.

Modifying Factor Order with `fct_reorder(f, x, fun)`

relig_summary |>
  mutate(relig = [?](relig, tvhours)) |>
  ggplot(aes(tvhours, relig)) + [?]()

We can reorder the levels using fct_reorder(f, x, fun), which can take three arguments.
f: the factor whose levels we want to modify.
x: a numeric vector that we want to use to reorder the levels.
Optionally, fun: a function that’s used if there are multiple values of x for each value of f. The default value is median.

Modifying Factor Order with `fct_reorder2(f, x, y)` 🔀

by_age <- gss_cat |>
  filter(!is.na(age)) |>
  count(age, marital) |>
  group_by(age) |>
  mutate(prop = n / sum(n))

# Default legend order (alphabetical)
ggplot(by_age, aes(x = age, y = prop, color = marital)) +
  geom_line(linewidth = 1) +
  scale_color_brewer(palette = "Set1")

# Legend order follows the line endings at the largest age
ggplot(by_age, aes(x = age, y = prop,
                   color = fct_reorder2(marital, age, prop))) +
  geom_line(linewidth = 1) +
  scale_color_brewer(palette = "Set1") +
  labs(color = "marital")

Modifying Factor Order with `fct_reorder2(f, x, y)` 🔀

fct_reorder2(f, x, y) (from forcats) is useful when you have multiple lines (one per factor level) and you want the legend order to follow the right-end (or left-end) of the lines.

f: the factor you want to reorder (e.g., marital)
x: the horizontal variable (e.g., age)
y: the numeric outcome used to rank levels (e.g., prop)
It orders factor levels by the y value at the largest x (i.e., near the right edge of the plot).
- This makes the legend easier to read because the legend order matches where lines end up on the right.

Modifying Factor Order with `fct_relevel(x, ref = ...)`

rincome_summary <- gss_cat |>
  group_by([?]) |>
  summarize(
    age = [?](age, [?]),
    tvhours = [?](tvhours, [?]),
    n = [?]
  )

ggplot(rincome_summary, 
       aes(age, [?](rincome, age) ) )  + 
  geom_point()
ggplot(rincome_summary, 
       aes(age, 
           [?](rincome, 
               "Not applicable") ) ) +
  geom_point()

We can use fct_relevel() to set the first level (reference level).
fct_relevel(x, ref = ...) takes at least the two arguments:
- x: factor variable
- ref: reference level or first level

🔤 Strings

Combining Strings with `str_c()`

str_length(c("a", "R for data science", NA))
str_c("x", "y", "z")
str_c(c("x", "y", "z"), sep = ", ")
str_c("prefix-", c("a", "b", "c"), "-suffix")
str_c(c("x", "y", "z"), collapse = ", ")

To count the length of string, use str_length().
To combine two or more strings, use str_c():
To control how strings are separated, add the sep.
To collapse a vector of strings into a single string, add the collapse.

Subsetting Strings `str_sub()`

x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)

## negative numbers count backwards from end
str_sub(x, -3, -1)

We can extract parts of a string using str_sub():
str_sub() takes start and end arguments which give the position of the substring.

Matching a Pattern of Strings: `str_detect()` and `str_replace_all()`

x <- c("apple", "banana", "pear")

str_detect(x, "e")
str_replace(x, "a", "-")
str_replace_all(x, "a", "-")

To determine if a character vector matches a pattern, use str_detect().
str_replace() and str_replace_all() allow us to replace matches with new strings.
str_replace_all() can perform multiple replacements by supplying a named vector.

Splitting Strings with `str_split()`

sentences |>
  head(5) |> 
  str_split(" ")

Use str_split() to split a string up into pieces.

Lecture 5

Tidy Data

What Are Tidy Data?

Tidy data

🔄 Pivoting

Pivoting

Longer

Longer

Longer

Wider

Wider

Wider

✂️ Separating and Uniting

Separating a Column with separate()

Separating a Column with separate()

Separating a Column with separate()

Separating a Column with separate()

Uniting Columns into a Single Column with unite()

🔗 Relational Data

Relational data

nycflights13

nycflights13

Relations in nycflights13

Key Variables

Relational Data

Joins

Joining Tables with inner_join()

Joining Tables with left_join()

Joining Tables with right_join()

Joining Tables with full_join()

Duplicate Keys

Duplicate Keys: One-to-many Join

Duplicate Keys: Many-to-many Join

Joins: Defining the Key Columns

🧩 Data Assembly

🧩 Data Assembly

Row-bind (stack rows): rbind() vs. bind_rows()

Column-bind (add columns side-by-side): cbind() vs. bind_cols()

Data Assembly: rbind() vs. bind_rows()

Data Assembly: cbind() vs. bind_cols()

Factors

Creating Factors

Creating Factors with factor()

Creating Factors with factor()

Creating Factors with factor(): labels Option

Data: General Social Survey

Modifying Factor Order

Modifying Factor Order with fct_reorder(f, x, fun)

Modifying Factor Order with fct_reorder2(f, x, y) 🔀

Modifying Factor Order with fct_reorder2(f, x, y) 🔀

Modifying Factor Order with fct_relevel(x, ref = ...)

🔤 Strings

Combining Strings with str_c()

Subsetting Strings str_sub()

Matching a Pattern of Strings: str_detect() and str_replace_all()

Splitting Strings with str_split()

Separating a Column with `separate()`

Separating a Column with `separate()`

Separating a Column with `separate()`

Separating a Column with `separate()`

Uniting Columns into a Single Column with `unite()`

`nycflights13`

`nycflights13`

Relations in `nycflights13`

Joining Tables with `inner_join()`

Joining Tables with `left_join()`

Joining Tables with `right_join()`

Joining Tables with `full_join()`

Row-bind (stack rows): `rbind()` vs. `bind_rows()`

Column-bind (add columns side-by-side): `cbind()` vs. `bind_cols()`

Data Assembly: `rbind()` vs. `bind_rows()`

Data Assembly: `cbind()` vs. `bind_cols()`

Creating Factors with `factor()`

Creating Factors with `factor()`

Creating Factors with `factor()`: `labels` Option

Modifying Factor Order with `fct_reorder(f, x, fun)`

Modifying Factor Order with `fct_reorder2(f, x, y)` 🔀

Modifying Factor Order with `fct_reorder2(f, x, y)` 🔀

Modifying Factor Order with `fct_relevel(x, ref = ...)`

Combining Strings with `str_c()`

Subsetting Strings `str_sub()`

Matching a Pattern of Strings: `str_detect()` and `str_replace_all()`

Splitting Strings with `str_split()`