Data Wrangling with tidyr, forcats, and stringr
March 2, 2026
There are three rules which make a dataset tidy:
tidyr package: pivot_longer() and pivot_wider().Some of the column names are not names of variables, but values of a variable.
We use pivot_longer() when a variable might be spread across multiple columns.
To tidy a dataset like table4a, we need to pivot the offending columns into a new pair of variables.
To use pivot_longer(), we need to the following three parameters:
cols) whose names are values, not variables.names_to).values_to).One observation might be scattered across multiple rows.
We use pivot_wider() when an observation is scattered across multiple rows.
pivot_wider(), we need two parameters:
names_prefix, for the prefix of column names.separate()table3 has one column (rate) that contains two variables (cases and population).
separate() takes the name of the column to separate, and the names of the columns to separate into.
separate()sep parameter.separate()separate() leaves the type of the column as is.separate()separate() will interpret the integers as positions to split at.
We can also pass a vector of integers to sep.
separate() will interpret the integer as positions to split at.
unite()unite() combines multiple columns into a single column.
The default will place an underscore (_) between the values from different columns.
It’s rare that a data analysis involves only a single data frame.
Collectively, multiple data frames are called relational data.
To work with relational data, we need verbs that work with pairs of data frames.
join methods add new variables to one data frame from matching observations in another data frame.
nycflights13nycflights13 contains four data frames that are related to the data frame, flights, that we used in data transformation.nycflights13flights connects to …
planes via a single variable, tailnum.
airlines through the carrier variable.
airports in two ways: via the origin and dest variables.
weather via origin (the location), and year, month, day and hour (the time).
nycflights13A key variable (or a set of key variables) is a variable (or a set of variables) that uniquely identifies an observation.
So, a key variable (or a set of key variables) is used to connect relational data.frames.
The name of a key variable can be different across relational data.frames.
| Tidyverse | SQL | Description |
|---|---|---|
| left | left outer | Keep all the observations from the left |
| right | right outer | Keep all the observations from the right |
| full | full outer | Keep all the observations from both left and right |
| inner | inner | Keep only the observations whose key values exist in both |
inner_join()
left_join()
x and adds matching information from y.left_join() is the most commonly used join.
x) and simply attaches extra information (y) when it exists.right_join()
y.full_join()
x and y.airlines and airports);flights and airplanes).by = "a": uses only variable a.by = c("a" = "b"): match variable a in data frame x to variable b in data frame y.rbind() vs. bind_rows()rbind() (base R) requires the same columns in the same order (and compatible types). If names/order don’t match, it typically errors.bind_rows() (dplyr) matches columns by name, and will fill missing columns with NA when the inputs have different sets of columns.bind_rows() when combining files that may have slightly different columns (common in real data).cbind() vs. bind_cols()cbind() (base R) binds by row position and can silently coerce to a matrix (e.g., mixing numbers + strings) and may recycle shorter vectors in some cases.bind_cols() (dplyr) also binds by row position, but is stricter about row counts and returns a tibble with safer name handling.id), use a join (left_join, inner_join) instead.rbind() vs. bind_rows()rbind() typically fails when the two data.frames have different column names.bind_rows() can still work because it matches columns by name and fills the rest with NA.cbind() vs. bind_cols()cbind() and bind_cols() require compatible row counts; otherwise it errors.R, factors are categorical variables, variables that have a fixed and known set of possible values.factor()factor().levels.NA.levels, they’ll be taken from the data in alphabetical order:factor()Sometimes we’d prefer that the order of the levels match the order of the first appearance in the data.
We can do that when creating the factor by setting levels to unique().
If we ever need to access the set of valid levels directly, we can do so with levels().
factor(): labels Optionlabels option in factor() allows us to assign custom display names to the levels.labels must be in the same order as the levels.labels, factor levels display as-is; with labels, we control how they appear in outputs and plots.We’re going to focus on the data frame, forcats::gss_cat.which is a sample of data from the General Social Survey.
When factors are stored in a data frame, we can see them with count().
It’s often useful to change the order of the factor levels in a visualization.
Imagine we want to explore the average number of hours spent watching TV per day across relig.
fct_reorder(f, x, fun)We can reorder the levels using fct_reorder(f, x, fun), which can take three arguments.
f: the factor whose levels we want to modify.
x: a numeric vector that we want to use to reorder the levels.
Optionally, fun: a function that’s used if there are multiple values of x for each value of f. The default value is median.
fct_reorder2(f, x, y) 🔀by_age <- gss_cat |>
filter(!is.na(age)) |>
count(age, marital) |>
group_by(age) |>
mutate(prop = n / sum(n))
# Default legend order (alphabetical)
ggplot(by_age, aes(x = age, y = prop, color = marital)) +
geom_line(linewidth = 1) +
scale_color_brewer(palette = "Set1")
# Legend order follows the line endings at the largest age
ggplot(by_age, aes(x = age, y = prop,
color = fct_reorder2(marital, age, prop))) +
geom_line(linewidth = 1) +
scale_color_brewer(palette = "Set1") +
labs(color = "marital")fct_reorder2(f, x, y) 🔀fct_reorder2(f, x, y) (from forcats) is useful when you have multiple lines (one per factor level) and you want the legend order to follow the right-end (or left-end) of the lines.
f: the factor you want to reorder (e.g., marital)x: the horizontal variable (e.g., age)y: the numeric outcome used to rank levels (e.g., prop)y value at the largest x (i.e., near the right edge of the plot).
fct_relevel(x, ref = ...)fct_relevel() to set the first level (reference level).fct_relevel(x, ref = ...) takes at least the two arguments:
x: factor variableref: reference level or first levelstr_c()To count the length of string, use str_length().
To combine two or more strings, use str_c():
To control how strings are separated, add the sep.
To collapse a vector of strings into a single string, add the collapse.
str_sub()We can extract parts of a string using str_sub():
str_sub() takes start and end arguments which give the position of the substring.
str_detect() and str_replace_all()str_detect().str_replace() and str_replace_all() allow us to replace matches with new strings.str_replace_all() can perform multiple replacements by supplying a named vector.str_split()str_split() to split a string up into pieces.