Data Transformation with dplyr
February 9, 2026
data.frameand your file is stored at:
data.framedata.frame$ operator extracts a single column from a data.frame as a vector (e.g., custdata$age).nrow()/ncol() returns the number of rows/columns.colnames() returns a character vector of column names.skimr::skim()skimr::skim() quickly summarizes types, missing values, and key descriptive stats, and it works well with pipes and grouped data (e.g., custdata |> dplyr::group_by(state_or_res) |> skimr::skim()).data.frame
data.frame is tidy when it follows three rules:
dplyr
dplyr is a core tidyverse package for data manipulation — tasks like filtering, sorting, selecting, renaming, and many more. Common dplyr functions with data.frame DF:filter(DF, LOGICAL_COND)arrange(DF, VARIABLES)distinct(DF, VARIABLES)select(DF, VARIABLES)relocate(DF, VARIABLES)rename(DF, NEW_NAME = CURRENT_NAME)mutate(DF, NEW_NAME = CURRENT_NAME)group_by(DF, VARIABLES)summarize(DF, NEW_NAME = CURRENT_NAME)slice_*(DF, VARIABLE, n = ..)dplyr functions take a data.frame as the first argument, and return a data.frame as output.dplyr Code Flow with the Pipe Operator|> or %>%) makes code easier to read by connecting steps in order.f(x, y) is the same as x |> f(y).DF:
filter(DF, logical_condition)DF |> filter(logical_condition)DF |> filter(LOGICAL_CONDITIONS)
DF |> arrange(VARIABLES)
DF |> distinct(VARIABLES)
DF |> select(VARIABLES)
DF |> relocate(VARIABLES)
DF |> rename(NEW_NAME = CURRENT_NAME)
DF |> mutate(NEW_NAME = CURRENT_NAME)
DF |> group_by(VARIABLES)
DF |> summarize(NEW_NAME = CURRENT_NAME)
DF |> slice_*(VARIABLE, n = ..)
This allows you to chain together multiple steps naturally:
Start with data, then filter, then arrange
DF |> filter(...) |> arrange(...)
|>) in RStudio:
|>).filter()filter()install.packages("nycflights13") # Install once
library(nycflights13)
library(tidyverse)
flights <- nycflights13::flights
flights$month == 1 # A logical test returns TRUE or FALSE
class(flights$month == 12) filter() keeps only the observations that meet one or more logical conditions.
TRUE (keep the observation) or FALSE (drop the observation).V1 and V2 are variables, and the comparisons are applied element-wise (vectorized).V1 and V2 are integer or numeric.x and y are logical conditions/variables.&, |, !) do is combining logical variables/conditions, which returns a logical variable when executed.x and y are logical conditions.
x is TRUE, it highlights the left circle.y is TRUE, it highlights the right circle.
filter(), separating conditions with a comma is equivalent to combining them with the & operator.NA)NA (not available) represents a missing or unknown value in R.NA).v1 is Mary’s age (unknown) and v2 is John’s age (unknown).NA.is.na()is.na() to test whether a value is missing (NA).filter(), you can:
is.na() to keep observations with missing values.!is.na() to remove observations with missing values.arrange()arrange()arrange() sorts out observations.arrange() sorts by the first variable, and then uses the next variable(s) to break ties.desc()desc(VARIABLE) to sort in descending order.
arrange() - Exampledistinct()distinct()distinct() removes duplicate observations.
distinct().
select()select()It’s not uncommon to get datasets with hundreds or thousands of variables.
select() allows us to narrow in on the variables we’re actually interested in.
We can select variables by their names.
select()select(-VARIABLES), we can remove variables.select()select():
starts_with("abc"): matches names that begin with “abc”.ends_with("xyz"): matches names that end with “xyz”.contains("ijk"): matches names that contain “ijk”.num_range("x", 1:3): matches x1, x2 and x3.rename()rename()rename() can be used to rename variables:
DF |> rename(NEW_NAME = CURRENT_NAME)relocate()relocate()relocate() to move variables around.relocate() with .before or .afterrelocate() to move variables around.
.before and .after arguments to choose where to put variables:mutate()mutate()mutate() to add variables to the data.frame:mutate() with .before or .after.before or .after argument to add the variables to the position of a column. is a sign that .before is an argument to the function, not the name of variable..before and .after, we can use the variable name instead of a position number.mutate() with Offsets and Cumulative FunctionsOffsets let you reference the previous or next value (useful for changes over time). Cumulative functions build running totals (or running mins/maxes).
mutate() with Ranking FunctionsRanking helpers are great for creating order, percentiles, or within-group standings.
rank_me <- data.frame(x = c(10, 5, 1, 5, 5, NA))
rank_me |>
mutate(
id = row_number(), # row id (1, 2, 3, ...)
x_row_number = row_number(x), # tie-breaking rank (arbitrary but stable)
x_min_rank = min_rank(x), # ranks with gaps when ties occur
x_dense_rank = dense_rank(x), # ranks without gaps when ties occur
x_percent = percent_rank(x),# 0 to 1 (excluding 1 unless max)
x_cume = cume_dist(x) # 0 to 1 (including 1 for max)
)mutate() with if_else() or ifelse()Use conditional functions to create a new variable based on a rule.
if_else()if_else() vs. ifelse()
if_else() when you want type safety (both outcomes must match).ifelse() only when you intentionally accept coercionmutate() with case_when()case_when() when you have 3+ conditions (an “if-else” chain). TRUE ~ ...
TRUE ~ ..., any observation that doesn’t match your listed conditions becomes NA.summarize()summarize() in one sentencesummarize() reduces rows by computing summary statistics.group_by(), you get one row for the entire dataset.group_by(...), you get one row per group.Mental model
group_by() splits → summarize() reduces → output combines
Key idea
group_by(cat2) splits the data into groups (one group per cat2 value).summarize(...) collapses each group down to one row (one row per group).Goal: For each destination (dest), compute
(1) number of flights, (2) average distance, (3) average arrival delay.
na.rm = TRUE MattersIf you don’t remove missing values, many summaries become NA.
A common approach is to create a dataset of not-cancelled flights first.
When you compute an average, include a count so you don’t trust tiny groups.
mean(x)na.rm = TRUE if there are missing valuessd(x)sd → more variabilitymin(x), max(x)quantile(x, p)p (e.g., 25th percentile)range(x)first(x), last(x)nth(x, n)arr_delay[arr_delay > 0][]arr_delay[arr_delay > 0] means:arr_delay that are greater than 0avg_delay can be pulled down by early flights (negative delays)avg_pos_delay answers: “When flights are late, how late are they on average?”n() counts rows in each groupsum(!is.na(x)) counts non-missing valuesn_distinct(x) counts distinct valuescount(x) is a shortcut for group_by(x) |> summarize(n = n())A histogram is basically “count by bins”.
sum(condition) counts how many are TRUEEach summarize() removes one layer of grouping by default.
Use ungroup() when you want to stop doing per-group operations.
mutate() and filter()mutate() and filter()Grouping is most common with summarize(), but it can also help with:
Goal: find (about) the 10 worst arrival delays within each day.
Goal: keep only destinations with lots of flights.
mutate()Goal: within each destination-day group, compute each flight’s share of total delay.