Lecture 5

Data Wrangling with tidyr, forcats, and stringr

Byeong-Hak Choe

SUNY Geneseo

March 2, 2026

Tidy Data

What Are Tidy Data?

  • There are three rules which make a dataset tidy:

    1. Each variable must have its own column.
    2. Each observation must have its own row.
    3. Each value must have its own cell.

Tidy data

library(tidyverse)
table1
table2
table3

## Spread across two tibbles
table4a  ## cases
table4b
  • We can represent the same underlying data in multiple ways.

🔄 Pivoting

Pivoting

  • The first step is always to figure out what form of the variables and observations we need.
  • The second step is to resolve one of two common problems:
    • One variable might be spread across multiple columns.
    • One observation might be scattered across multiple rows.
  • To fix these problems, we may need the two most important functions in the tidyr package: pivot_longer() and pivot_wider().

Longer

table4a
  • Some of the column names are not names of variables, but values of a variable.

  • We use pivot_longer() when a variable might be spread across multiple columns.

Longer

table4a |> 
  pivot_longer(cols = c(`1999`, `2000`), 
               names_to = "year", 
               values_to = "cases")
  • To tidy a dataset like table4a, we need to pivot the offending columns into a new pair of variables.

  • To use pivot_longer(), we need to the following three parameters:

    • The set of columns (cols) whose names are values, not variables.
    • The name of the variable to move the column names to (names_to).
    • The name of the variable to move the column values to (values_to).

Longer

Wider

table2
  • One observation might be scattered across multiple rows.

  • We use pivot_wider() when an observation is scattered across multiple rows.

Wider

table2
table2 |>
    pivot_wider(names_from = type, values_from = count)
  • To use pivot_wider(), we need two parameters:
    • The column to take variable names from.
    • The column to take values from.
  • We can add the optimal parameter, names_prefix, for the prefix of column names.

Wider

✂️ Separating and Uniting

Separating a Column with separate()

table3

table3 |> 
  separate(rate, into = c("cases", "population"))
  • table3 has one column (rate) that contains two variables (cases and population).

  • separate() takes the name of the column to separate, and the names of the columns to separate into.

Separating a Column with separate()

table3 |> 
  separate(rate, into = c("cases", "population"), sep = "/")
  • If we want to use a specific chracter to seprate a column, we can pass the character to the sep parameter.

Separating a Column with separate()

table3 |> 
  separate(rate, into = c("cases", "population"))
  • By default, separate() leaves the type of the column as is.

Separating a Column with separate()

table3 |> 
  separate(year, into = c("century", "year"), sep = 2)
  • separate() will interpret the integers as positions to split at.

    • We can also pass a vector of integers to sep.

    • separate() will interpret the integer as positions to split at.

Uniting Columns into a Single Column with unite()

table5 |> 
  unite(new, century, year)
  
table5 |> 
  unite(new, century, year, sep = "")
  • unite() combines multiple columns into a single column.

  • The default will place an underscore (_) between the values from different columns.

🔗 Relational Data

Relational data

  • It’s rare that a data analysis involves only a single data frame.

  • Collectively, multiple data frames are called relational data.

  • To work with relational data, we need verbs that work with pairs of data frames.

  • join methods add new variables to one data frame from matching observations in another data frame.

nycflights13

library(nycflights13)
flights
airlines
airports
planes
weather
  • nycflights13 contains four data frames that are related to the data frame, flights, that we used in data transformation.

nycflights13

  • flights connects to …

  • planes via a single variable, tailnum.

  • airlines through the carrier variable.

  • airports in two ways: via the origin and dest variables.

  • weather via origin (the location), and year, month, day and hour (the time).

Relations in nycflights13

Key Variables

  • A key variable (or a set of key variables) is a variable (or a set of variables) that uniquely identifies an observation.

    • So, a key variable (or a set of key variables) is used to connect relational data.frames.

    • The name of a key variable can be different across relational data.frames.

Relational Data

  • A join allows us to combine two data frames via key variables.
    • It first matches observations by their keys, then copies across variables from one data.frame to the other.
Tidyverse SQL Description
left left outer Keep all the observations from the left
right right outer Keep all the observations from the right
full full outer Keep all the observations from both left and right
inner inner Keep only the observations whose key values exist in both

Joins

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     3, "x3" )
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     4, "y3" )
  • The colored column represents the “key” variable.
  • The grey column represents the “value” column.

Joining Tables with inner_join()

x |>
  inner_join(y)
  • An inner join matches pairs of observations whenever their keys are equal:

Joining Tables with left_join()

x |>
  left_join(y)
  • A left join keeps all rows from x and adds matching information from y.
  • Among the different join types, left_join() is the most commonly used join.
    • It does not lose information from your main data.frame (x) and simply attaches extra information (y) when it exists.

Joining Tables with right_join()

x |>
  right_join(y)
  • A right join keeps all observations in y.

Joining Tables with full_join()

x |> 
  full_join(y)
  • A full join keeps all observations in x and y.

Duplicate Keys

  • Relations between data.frames could be …
    • One-to-one;
    • One-to-many (e.g., airlines and airports);
    • Many-to-one;
    • Many-to-many (e.g., flights and airplanes).
  • One-to-many relation is such that one data.frame has duplicate keys
  • Many-to-many relation is such that both data.frames have duplicate keys.
    • In practice, it is recommended to avoid joining data.frames with many-to-many relation.

Duplicate Keys: One-to-many Join

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     2, "x3",
     1, "x4")
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2")

x |> 
  left_join(y)

Duplicate Keys: Many-to-many Join

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     2, "x3",
     3, "x4" )
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     2, "y3",
     3, "y4" )
x |> 
  left_join(y)

Joins: Defining the Key Columns

flights2 |> 
  left_join(weather)
  
flights2 |> 
  left_join(planes, 
            by = "tailnum")
flights2 |> 
  left_join(airports, 
            c("dest" = "faa"))
  
flights2 |> 
  left_join(airports, 
            c("origin" = "faa"))
  • by = "a": uses only variable a.
  • by = c("a" = "b"): match variable a in data frame x to variable b in data frame y.

🧩 Data Assembly

🧩 Data Assembly

df1 <- data.frame(id = 1:3, 
                  name = c("Alice", "Bob", "Charlie")
                  )
df2 <- data.frame(id = 4:6, 
                  name = c("Dave", "Eve", "Frank")
                  )

df_rbind <- rbind(df1, df2)
df_rbind <- bind_rows(df1, df2)
df_cbind <- cbind(df1, df2) 
df_cbind <- bind_cols(df1, df2) ## better to rename id before cbind()

Row-bind (stack rows): rbind() vs. bind_rows()

  • rbind() (base R) requires the same columns in the same order (and compatible types). If names/order don’t match, it typically errors.
  • bind_rows() (dplyr) matches columns by name, and will fill missing columns with NA when the inputs have different sets of columns.
  • Tip: use bind_rows() when combining files that may have slightly different columns (common in real data).

Column-bind (add columns side-by-side): cbind() vs. bind_cols()

df1 <- data.frame(id = 1:3, 
                  name = c("Alice", "Bob", "Charlie"))
df2 <- data.frame(id = 4:6, 
                  name = c("Dave", "Eve", "Frank"))
df_cbind <- cbind(df1, df2) 
df_cbind2 <- bind_cols(df1, df2) 
  • cbind() (base R) binds by row position and can silently coerce to a matrix (e.g., mixing numbers + strings) and may recycle shorter vectors in some cases.
  • bind_cols() (dplyr) also binds by row position, but is stricter about row counts and returns a tibble with safer name handling.
  • Tip: column-binding is best when the rows are already in the same order; if you need to match by a key (e.g., id), use a join (left_join, inner_join) instead.

Data Assembly: rbind() vs. bind_rows()

df1 <- data.frame(id1 = 1:3, 
                  name = c("Alice", "Bob", "Charlie")
                  )
df2 <- data.frame(id2 = 4:6, 
                  name = c("Dave", "Eve", "Frank")
                  )

df_rbind <- rbind(df1, df2)
df_rbind2 <- bind_rows(df1, df2)
  • rbind() typically fails when the two data.frames have different column names.
  • bind_rows() can still work because it matches columns by name and fills the rest with NA.

Data Assembly: cbind() vs. bind_cols()

df1 <- data.frame(id1 = 1:5, 
                  name = c("Alice", "Bob", "Charlie", "Dan", "Eva")
                  )
df2 <- data.frame(id2 = 6:8, 
                  name = c("Dave", "Eve", "Frank")
                  )

df_cbind <- cbind(df1, df2)
df_cbind2 <- bind_cols(df1, df2)
  • Both cbind() and bind_cols() require compatible row counts; otherwise it errors.

Factors

Creating Factors

  • In R, factors are categorical variables, variables that have a fixed and known set of possible values.
x1 <- c("Dec", "Apr", "Jan", "Mar")
  • Using a string to record variable x1 has two problems:
    1. There are only twelve possible months, and there’s nothing saving us from typos.
    2. It doesn’t sort in a useful way.
x2 <- c("Dec", "Apr", "Jam", "Mar")
sort(x1)

Creating Factors with factor()

  • We can fix both of these problems with factor().
  • To create a factor, we must start by creating a list of the valid levels.
  • Any values not in the set will be silently converted to NA.
  • If we omit the levels, they’ll be taken from the data in alphabetical order:
months <- c(
  "Jan", "Feb", "Mar", "Apr", 
  "May", "Jun", "Jul", "Aug", 
  "Sep", "Oct", "Nov", "Dec")
x1 <- 
  c("Dec", "Apr", "Jan", "Mar")
y1 <- factor(x1, 
             levels = months)
sort(y1)
x2 <- 
  c("Dec", "Apr", "Jam", "Mar")

y2 <- factor(x2, 
             levels = months)
y2

factor(x1)

Creating Factors with factor()

x1

f1 <- factor(x1, levels = unique(x1))
f1

levels(f1)
  • Sometimes we’d prefer that the order of the levels match the order of the first appearance in the data.

  • We can do that when creating the factor by setting levels to unique().

  • If we ever need to access the set of valid levels directly, we can do so with levels().

Creating Factors with factor(): labels Option

  • The labels option in factor() allows us to assign custom display names to the levels.
  • The labels must be in the same order as the levels.
x1 <- c("Dec", "Apr", "Jan", "Mar")

y3 <- 
  factor(x1,
         levels = c("Jan", "Apr", 
                    "Mar", "Dec"),
         labels = c("January", "April",
                    "March", "December"))
y3
# labels are useful for recoding
x3 <- c(1, 2, 2, 1, 3)

y4 <- factor(x3,
             levels = c(1, 2, 3),
             labels = c("Low", 
                        "Medium", 
                        "High"))
y4
levels(y4)
  • Without labels, factor levels display as-is; with labels, we control how they appear in outputs and plots.

Data: General Social Survey

gss_cat

gss_cat |>
  [?](race)
  • We’re going to focus on the data frame, forcats::gss_cat.which is a sample of data from the General Social Survey.

  • When factors are stored in a data frame, we can see them with count().

Modifying Factor Order

relig_summary <- gss_cat |>
  group_by([?]) |>
  summarize(
    age = [?](age, [?]),
    tvhours = [?](tvhours, [?]),
    n = [?]
  )
ggplot(relig_summary, 
       aes(tvhours, relig)) + 
  [?]()
  • It’s often useful to change the order of the factor levels in a visualization.

  • Imagine we want to explore the average number of hours spent watching TV per day across relig.

Modifying Factor Order with fct_reorder(f, x, fun)

relig_summary |>
  mutate(relig = [?](relig, tvhours)) |>
  ggplot(aes(tvhours, relig)) + [?]()
  • We can reorder the levels using fct_reorder(f, x, fun), which can take three arguments.

  • f: the factor whose levels we want to modify.

  • x: a numeric vector that we want to use to reorder the levels.

  • Optionally, fun: a function that’s used if there are multiple values of x for each value of f. The default value is median.

Modifying Factor Order with fct_reorder2(f, x, y) 🔀

by_age <- gss_cat |>
  filter(!is.na(age)) |>
  count(age, marital) |>
  group_by(age) |>
  mutate(prop = n / sum(n))

# Default legend order (alphabetical)
ggplot(by_age, aes(x = age, y = prop, color = marital)) +
  geom_line(linewidth = 1) +
  scale_color_brewer(palette = "Set1")

# Legend order follows the line endings at the largest age
ggplot(by_age, aes(x = age, y = prop,
                   color = fct_reorder2(marital, age, prop))) +
  geom_line(linewidth = 1) +
  scale_color_brewer(palette = "Set1") +
  labs(color = "marital")

Modifying Factor Order with fct_reorder2(f, x, y) 🔀

fct_reorder2(f, x, y) (from forcats) is useful when you have multiple lines (one per factor level) and you want the legend order to follow the right-end (or left-end) of the lines.

  • f: the factor you want to reorder (e.g., marital)
  • x: the horizontal variable (e.g., age)
  • y: the numeric outcome used to rank levels (e.g., prop)
  • It orders factor levels by the y value at the largest x (i.e., near the right edge of the plot).
    • This makes the legend easier to read because the legend order matches where lines end up on the right.

Modifying Factor Order with fct_relevel(x, ref = ...)

rincome_summary <- gss_cat |>
  group_by([?]) |>
  summarize(
    age = [?](age, [?]),
    tvhours = [?](tvhours, [?]),
    n = [?]
  )
ggplot(rincome_summary, 
       aes(age, [?](rincome, age) ) )  + 
  geom_point()
ggplot(rincome_summary, 
       aes(age, 
           [?](rincome, 
               "Not applicable") ) ) +
  geom_point()
  • We can use fct_relevel() to set the first level (reference level).
  • fct_relevel(x, ref = ...) takes at least the two arguments:
    • x: factor variable
    • ref: reference level or first level

🔤 Strings

Combining Strings with str_c()

str_length(c("a", "R for data science", NA))
str_c("x", "y", "z")
str_c(c("x", "y", "z"), sep = ", ")
str_c("prefix-", c("a", "b", "c"), "-suffix")
str_c(c("x", "y", "z"), collapse = ", ")
  • To count the length of string, use str_length().

  • To combine two or more strings, use str_c():

  • To control how strings are separated, add the sep.

  • To collapse a vector of strings into a single string, add the collapse.

Subsetting Strings str_sub()

x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)

## negative numbers count backwards from end
str_sub(x, -3, -1)
  • We can extract parts of a string using str_sub():

  • str_sub() takes start and end arguments which give the position of the substring.

Matching a Pattern of Strings: str_detect() and str_replace_all()

x <- c("apple", "banana", "pear")

str_detect(x, "e")
str_replace(x, "a", "-")
str_replace_all(x, "a", "-")
  • To determine if a character vector matches a pattern, use str_detect().
  • str_replace() and str_replace_all() allow us to replace matches with new strings.
  • str_replace_all() can perform multiple replacements by supplying a named vector.

Splitting Strings with str_split()

sentences |>
  head(5) |> 
  str_split(" ")
  • Use str_split() to split a string up into pieces.