Lecture 3

Data Transformation with dplyr

Byeong-Hak Choe

SUNY Geneseo

February 9, 2026

Reading a Data Table as a data.frame

📍 Absolute Pathnames

  • An absolute pathname tells the computer the full, exact location of a file, starting from the top-level folder of the computer.
    • Because it starts from the top, it does not depend on your current working directory.
  • In R, you can check your working directory (the folder R is currently “looking in” by default) by running:
getwd()

âś… Examples (absolute path to custdata_rev.csv)

  • Mac
/Users/yourname/Documents/data/custdata_rev.csv
  • Windows
C:\\Users\\yourname\\Documents\\data\\custdata_rev.csv

📎 Relative Pathnames

  • A relative pathname gives a file location relative to the working directory.
    • In many web-related projects (and collaborative projects), relative paths are almost necessary.
  • Suppose your working directory is
`/Users/yourname/Documents/my_project/`

and your file is stored at:

`/Users/yourname/Documents/my_project/data/custdata_rev.csv`
  • Then the relative path for the file custdata_rev.csv is:
`data/custdata_rev.csv`

🔎 Getting to Know a data.frame

custdata <- read_csv('https://bcdanl.github.io/data/custdata_rev.csv')

Basic structure + a single column

class(custdata)
custdata$age

Size of the data.frame

nrow(custdata)  # rows only
ncol(custdata)  # columns only

Column names

colnames(custdata)
  • The $ operator extracts a single column from a data.frame as a vector (e.g., custdata$age).
  • nrow()/ncol() returns the number of rows/columns.
  • colnames() returns a character vector of column names.

Quick Summary with skimr::skim()

# Load and summarize
library(skimr)
skim(custdata)
  • skimr::skim() quickly summarizes types, missing values, and key descriptive stats, and it works well with pipes and grouped data (e.g., custdata |> dplyr::group_by(state_or_res) |> skimr::skim()).

Tidy data.frame

table1  # a sample data.frame from the tidyr package (part of the tidyverse)

  • Rows represent individual units (observations) where data is collected.
  • Columns represent variables (attributes) measured for each observation.
  • Tidy data is easier to read, analyze, and share.
  • A data.frame is tidy when it follows three rules:
    1. Each variable is a column.
    2. Each observation is a row.
    3. Each value is a cell.

Data Transformation

Data Transformation with dplyr

  • dplyr is a core tidyverse package for data manipulation — tasks like filtering, sorting, selecting, renaming, and many more. Common dplyr functions with data.frame DF:
  • filter(DF, LOGICAL_COND)
  • arrange(DF, VARIABLES)
  • distinct(DF, VARIABLES)
  • select(DF, VARIABLES)
  • relocate(DF, VARIABLES)
  • rename(DF, NEW_NAME = CURRENT_NAME)
  • mutate(DF, NEW_NAME = CURRENT_NAME)
  • group_by(DF, VARIABLES)
  • summarize(DF, NEW_NAME = CURRENT_NAME)
  • slice_*(DF, VARIABLE, n = ..)
  • dplyr functions take a data.frame as the first argument, and return a data.frame as output.

Making dplyr Code Flow with the Pipe Operator

  • The pipe operator (|> or %>%) makes code easier to read by connecting steps in order.
  • How it works:
    • f(x, y) is the same as x |> f(y).
    • Example with a data frame DF:
      • filter(DF, logical_condition)
      • is the same as
        DF |> filter(logical_condition)
  • You can read the pipe as “then”:
    • Example: Take the data, then filter it

Data Transformation with the Pipe

  • DF |> filter(LOGICAL_CONDITIONS)

  • DF |> arrange(VARIABLES)

  • DF |> distinct(VARIABLES)

  • DF |> select(VARIABLES)

  • DF |> relocate(VARIABLES)

  • DF |> rename(NEW_NAME = CURRENT_NAME)

  • DF |> mutate(NEW_NAME = CURRENT_NAME)

  • DF |> group_by(VARIABLES)

  • DF |> summarize(NEW_NAME = CURRENT_NAME)

  • DF |> slice_*(VARIABLE, n = ..)

  • This allows you to chain together multiple steps naturally:

Start with data, then filter, then arrange

DF |> filter(...) |> arrange(...)

Using the Pipe in RStudio

  • To enable the native pipe operator (|>) in RStudio:
    • Go to Tools > Global Options > Code (side menu).
    • Under Pipe operator, choose Use native pipe operator (|>).
  • Keyboard shortcut for inserting a pipe:
    • Windows: Ctrl + Shift + M
    • Mac: Cmd + Shift + M

Filter observations with filter()

Filter Observations with filter()

install.packages("nycflights13")  # Install once
library(nycflights13)
library(tidyverse)

flights <- nycflights13::flights

flights$month == 1  # A logical test returns TRUE or FALSE
class(flights$month == 12)  
# Flights in January or December
jan <- flights |> 
  filter(month == 1)

dec <- flights |> 
  filter(month == 12)
# Flights on January 1st or December 25th
jan1 <- flights |> 
  filter(month == 1, day == 1)

dec25 <- flights |> 
  filter(month == 12, day == 25)
  • filter() keeps only the observations that meet one or more logical conditions.
    • A logical condition evaluates to TRUE (keep the observation) or FALSE (drop the observation).

Logical Conditions with Equality and Inequality

  • Here, both V1 and V2 are variables, and the comparisons are applied element-wise (vectorized).
  • For logical conditions using inequalities, we focus on cases where V1 and V2 are integer or numeric.

Logical Conditions - Example 1

df <- data.frame(
  num = c(8, 9, 10, 11),
  chr = c("A", "C", "B", "A")
  )
df |> 
  filter(num > 8)
df |> 
  filter(chr == "A")

Logical Operators — AND, OR, NOT

  • Here, both x and y are logical conditions/variables.
  • What logical operations (&, |, !) do is combining logical variables/conditions, which returns a logical variable when executed.

Logical Operations

  • x and y are logical conditions.
    • If x is TRUE, it highlights the left circle.
    • If y is TRUE, it highlights the right circle.

  • The shaded regions show which parts each logical operator selects.

Logical Conditions - Example 2

df <- data.frame(
  num = c(8, 9, 10, 11),
  chr = c("A", "C", "B", "A")
  )
df |> 
  filter(num > 8 & 
         num < 11)
                
df |> 
  filter(num > 8,
         num < 11)
  • In filter(), separating conditions with a comma is equivalent to combining them with the & operator.

Logical Conditions - Example 3

df <- data.frame(
  num = c(8, 9, 10, 11),
  chr = c("A", "C", "B", "A")
  )
df |> 
  filter(num < 10 & 
         chr == "A")

df |> 
  filter(num < 10, 
         chr == "A")

Logical Conditions - Example 4

df <- data.frame(
  num = c(8, 9, 10, 11),
  chr = c("A", "C", "B", "A")
  )
df |> 
  filter(num < 10 | 
         chr == "A")

Missing Values (NA)

NA > 5
10 == NA
NA + 10
NA / 2
(1 + NA + 3) / 3

mean( c(1, NA, 3) )
sd( c(1, NA, 3) )
  • NA (not available) represents a missing or unknown value in R.
  • In most calculations, if one value is unknown, the result will also be unknown (NA).

Comparing Missing Values

v1 <- NA
v2 <- NA
v1 == v2
  • Suppose v1 is Mary’s age (unknown) and v2 is John’s age (unknown).
  • Can we say they are the same age?
  • Since both values are missing, R cannot know — so the result is also NA.

Checking for Missing Values with is.na()

num_missing <- NA
is.na(num_missing)  # is num_missing NA?

text_missing <- "missing"
is.na(text_missing) # is text_missing NA?

V1 <- c(1, NA, 3)
is.na(V1)  # is V1 NA?
!is.na(V1)  # is V1 not NA?
df <- data.frame(
  v1 = c(1, NA, 3),
  v2 = c(1, 2, 3)
  )

df |> 
  filter( is.na(v1) )

df |> 
  filter( !is.na(v1) )
  • Use is.na() to test whether a value is missing (NA).
  • In filter(), you can:
    • Use is.na() to keep observations with missing values.
    • Use !is.na() to remove observations with missing values.

Arrange Observations with arrange()

Arrange Observations with arrange()

# Sort observations 
  # by dep_delay (ascending)
flights |> 
  arrange(dep_delay)
# Sort by dep_delay, 
  # then by sched_dep_time
flights |> 
  arrange(dep_delay, sched_dep_time)
  • arrange() sorts out observations.
  • When you provide multiple variables, arrange() sorts by the first variable, and then uses the next variable(s) to break ties.

Descending Order with desc()

# Sort by departure delay 
  # in descending order
flights |> 
  arrange(desc(dep_delay))
# Equivalent shortcut 
  # for numeric variables
flights |> 
  arrange(-dep_delay)
  • Use desc(VARIABLE) to sort in descending order.
    • For numeric variables, you can also use a leading minus sign.

Arrange Observations with arrange() - Example

df <- data.frame(
  year = c(2024, 2021, 2024, 2024),
  month = c(7, 10, 7, 4),
  day = c(20, 19, 15, 9)
)
df |> 
  arrange(year, month, day)
  • If we provide more than one variable name, each additional variable will be used to break ties in the values of preceding variables.

Find All Unique Observations with distinct()

Find All Unique Observations with distinct()

df <- data.frame(
  country = c("USA", "Korea", "USA"),
  city = c("D.C.", "Seoul", "D.C."),
  subregion = c("Georgetown", 
                "Gangnam", 
                "Georgetown") 
  )
# Remove duplicate observations
df |> 
  distinct()
  • distinct() removes duplicate observations.
    • By default, it checks all variables to find unique observations.

Find All Unique Combinations of Selected Variables

# Remove duplicate observations 
  # from the flights data.frame
flights |> 
  distinct()
# Find all unique 
  #  origin-destination pairs
flights |> 
  distinct(origin, dest)
  • You can pass one or more variables to distinct().
    • This returns unique combinations of just those variables (while ignoring the rest).

Select Variables with select()

Select Variables with select()

  • It’s not uncommon to get datasets with hundreds or thousands of variables.

  • select() allows us to narrow in on the variables we’re actually interested in.

  • We can select variables by their names.

flights |> 
  select(year, month, day)

Remove Variables with select()

flights |> 
  select(-year)
  • With select(-VARIABLES), we can remove variables.

Helper Functions with select()

  • There are a number of helper functions we can use within select():
    • starts_with("abc"): matches names that begin with “abc”.
    • ends_with("xyz"): matches names that end with “xyz”.
    • contains("ijk"): matches names that contain “ijk”.
    • num_range("x", 1:3): matches x1, x2 and x3.

Rename Variables with rename()

Rename Variables with rename()

flights |> 
  rename( tail_num = tailnum )
  • rename() can be used to rename variables:

    • DF |> rename(NEW_NAME = CURRENT_NAME)

Relocate Variables with relocate()

Relocate Variables with relocate()

flights |> 
  relocate(time_hour, air_time)
  • We can use relocate() to move variables around.

relocate() with .before or .after

flights |> 
  relocate(year:dep_time, .after = time_hour)

flights |> 
  relocate(starts_with("arr"), .before = dep_time)
  • We can also use relocate() to move variables around.
    • We can specify .before and .after arguments to choose where to put variables:

Add New Variables with mutate()

Add New Variables with mutate()

flights |> 
  mutate(
    gain = dep_delay - arr_delay,
    speed = distance / air_time * 60
  )
  • We can use the mutate() to add variables to the data.frame:

mutate() with .before or .after

flights |> 
  mutate(
    gain = dep_delay - arr_delay,
    speed = distance / air_time * 60,
    .before = 1  ## try different position numbers.
  )
flights |> 
  mutate(
    gain = dep_delay - arr_delay,
    speed = distance / air_time * 60,
    .after = day  
  )
  • We can use the .before or .after argument to add the variables to the position of a column
  • The . is a sign that .before is an argument to the function, not the name of variable.
  • In both .before and .after, we can use the variable name instead of a position number.

mutate() with Offsets and Cumulative Functions

Offsets let you reference the previous or next value (useful for changes over time). Cumulative functions build running totals (or running mins/maxes).

df <- data.frame(x = 1:10,
                 y = c(2,4,6,8,10,
                       1,3,5,7,9))
df <- df |> 
  mutate(
    x_lag = lag(x),  # previous value
    x_lead = lead(x),  # next value
    y_cumsum = cumsum(y),    # running sum
    y_cummean = cummean(y),   # running average
    y_cummin = cummin(y),    # running minimum
    y_cummax = cummax(y)    # running maximum
  )

mutate() with Ranking Functions

Ranking helpers are great for creating order, percentiles, or within-group standings.

rank_me <- data.frame(x = c(10, 5, 1, 5, 5, NA))

rank_me |>
  mutate(
    id = row_number(),             # row id (1, 2, 3, ...)
    x_row_number = row_number(x),  # tie-breaking rank (arbitrary but stable)
    x_min_rank   = min_rank(x),    # ranks with gaps when ties occur
    x_dense_rank = dense_rank(x),  # ranks without gaps when ties occur
    x_percent    = percent_rank(x),# 0 to 1 (excluding 1 unless max)
    x_cume       = cume_dist(x)    # 0 to 1 (including 1 for max)
  )

mutate() with if_else() or ifelse()

Use conditional functions to create a new variable based on a rule.

if_else()

flight_season <- flights |>
  mutate(
    summer_month = 
      if_else(month %in% c(6, 7, 8), 
              TRUE, FALSE)
  )
  • Strict: both outputs must be the same type (both logical, both numeric, both character, etc.).

ifelse()

flight_season <- flights |>
  mutate(
    summer_month = 
      ifelse(month %in% c(6, 7, 8), 
             TRUE, 0)  # coerces to 
  )                    #    numeric
  • More flexible (will coerce types if needed), but can be easier to misuse.

if_else() vs. ifelse()

  • Use if_else() when you want type safety (both outcomes must match).
  • Use ifelse() only when you intentionally accept coercion

mutate() with case_when()

scores <- tibble(score = c(95, 82, 74, 61, NA))
scores |>
  mutate(
    grade = case_when(
      is.na(score)      ~ "missing",
      score >= 90       ~ "A",
      score >= 80       ~ "B",
      score >= 70       ~ "C",
      score >= 60       ~ "D",
      TRUE              ~ "F"
    )
  )
  • Use case_when() when you have 3+ conditions (an “if-else” chain).
  • Order matters: the first matching condition is used.
  • Recommend including a default case: TRUE ~ ...
    • Without TRUE ~ ..., any observation that doesn’t match your listed conditions becomes NA.

Grouped Summaries with summarize()

summarize() in one sentence

  • summarize() reduces rows by computing summary statistics.
  • Without group_by(), you get one row for the entire dataset.
  • With group_by(...), you get one row per group.
# One row (entire dataset)
flights |> 
  summarize(mean_dep_delay = mean(dep_delay, na.rm = TRUE))

# One row per day
flights |>
  group_by(year, month, day) |>
  summarize(mean_dep_delay = mean(dep_delay, na.rm = TRUE))

Mental model

group_by() splits → summarize() reduces → output combines

What Happens When We Group Then Summarize?

Split-Apply-Combine

df <- data.frame(
  cat1 = c(rep("a",3), rep("b", 3), rep("c", 3)),
  cat2 = c("j", "j", "k", "j", "k", "k", "j", "j", "k"),
  x = 1:9
)

df |> 
  group_by(cat2) |> 
  summarize(
    avg   = mean(x),
    total = sum(x),
    n     = n()
  )

Key idea

  • group_by(cat2) splits the data into groups (one group per cat2 value).
  • summarize(...) collapses each group down to one row (one row per group).

Example: Destination-level Summaries

Goal: For each destination (dest), compute
(1) number of flights, (2) average distance, (3) average arrival delay.

by_dest <- flights |> 
  group_by(dest) |> 
  summarize(
    n_flights = n(),
    mean_dist = mean(distance, na.rm = TRUE),
    mean_arr_delay = mean(arr_delay, na.rm = TRUE)
  )

Visualize the Relationship

by_dest |> 
  filter(n_flights > 20, dest != "HNL") |> 
  ggplot(aes(x = mean_dist, y = mean_arr_delay)) +
  geom_point(aes(size = n_flights), alpha = 1/3) +
  geom_smooth(se = FALSE)

Missing Values: Why na.rm = TRUE Matters

If you don’t remove missing values, many summaries become NA.

# Missing values included → result may be NA
flights |> 
  group_by(year, month, day) |> 
  summarize(mean_dep_delay = mean(dep_delay))

# Missing values removed → numeric result
flights |> 
  group_by(year, month, day) |> 
  summarize(mean_dep_delay = mean(dep_delay, na.rm = TRUE))

If Missing Means “cancelled”

A common approach is to create a dataset of not-cancelled flights first.

not_cancelled <- flights |> 
  filter(!is.na(dep_delay), !is.na(arr_delay))

Counts: always include a size check

When you compute an average, include a count so you don’t trust tiny groups.

delays_by_plane <- not_cancelled |> 
  group_by(tailnum) |> 
  summarize(
    mean_arr_delay = mean(arr_delay, na.rm = TRUE),
    n_flights = n()
  )
ggplot(delays_by_plane, 
       aes(x = n_flights, 
           y = mean_arr_delay)) +
  geom_point(alpha = 1/10)
delays_by_plane |> 
  filter(n_flights > 25) |> 
  ggplot(aes(x = n_flights, 
             y = mean_arr_delay)) +
  geom_point(alpha = 1/10)

Useful Summary Functions: Location (Center)

Mean: mean(x)

not_cancelled |> 
  summarize(mean_arr_delay = mean(arr_delay, na.rm = TRUE))
  • Average value (adds values and divides by count)
  • Use na.rm = TRUE if there are missing values

Median: median(x)

not_cancelled |> 
  summarize(median_arr_delay = median(arr_delay, na.rm = TRUE))
  • Middle value (50th percentile)
  • More robust to outliers than the mean

Useful Summary Functions: Spread (Variation)

Standard deviation: sd(x)

not_cancelled |> 
  summarize(sd_arr_delay = sd(arr_delay, na.rm = TRUE))
  • Typical distance from the mean
  • Bigger sd → more variability

Interquartile range: IQR(x)

not_cancelled |> 
  summarize(iqr_arr_delay = IQR(arr_delay, na.rm = TRUE))
  • Spread of the middle 50%: Q3 - Q1
  • Robust to extreme outliers

Minimum / maximum: min(x), max(x)

not_cancelled |> 
  group_by(year, month, day) |> 
  summarize(
    earliest_dep = min(dep_time, na.rm = TRUE),
    latest_dep   = max(dep_time, na.rm = TRUE)
  )
  • Smallest and largest values in a group

Quantiles: quantile(x, p)

not_cancelled |> 
  summarize(
    q25_arr_delay = quantile(arr_delay, 0.25, na.rm = TRUE),
    q75_arr_delay = quantile(arr_delay, 0.75, na.rm = TRUE)
  )
  • Value at percentile p (e.g., 25th percentile)

Range: range(x)

not_cancelled |> 
  summarize(dep_time_range = list(range(dep_time, na.rm = TRUE)))
  • Returns both min and max together

First / last: first(x), last(x)

not_cancelled |> 
  arrange(year, month, day, dep_time) |> 
  group_by(year, month, day) |> 
  summarize(
    first_dep = first(dep_time),
    last_dep  = last(dep_time)
  )
  • First and last values in the current row order
  • If you need a specific definition of “first”, sort first!

Nth: nth(x, n)

not_cancelled |> 
  arrange(year, month, day, dep_time) |> 
  group_by(year, month, day) |> 
  summarize(second_dep = nth(dep_time, 2))
  • The n-th value in the current order

Conditional Subsetting Inside a Summary

Example: arr_delay[arr_delay > 0]

not_cancelled |> 
  group_by(year, month, day) |> 
  summarize(
    avg_delay = mean(arr_delay, na.rm = TRUE),
    avg_pos_delay = mean(arr_delay[arr_delay > 0], na.rm = TRUE)
  )
  • Inside a summary, you can “filter” a single column using brackets []
  • arr_delay[arr_delay > 0] means:
    keep only the values of arr_delay that are greater than 0

Why do this?

  • avg_delay can be pulled down by early flights (negative delays)
  • avg_pos_delay answers: “When flights are late, how late are they on average?”

When do the first and last flights leave each day?

Option A: min / max (robust)

not_cancelled |> 
  group_by(year, month, day) |> 
  summarize(
    first_dep = min(dep_time, na.rm = TRUE),
    last_dep  = max(dep_time, na.rm = TRUE)
  )

Option B: first / last (order-dependent)

not_cancelled |> 
  group_by(year, month, day) |> 
  summarize(
    first_dep = first(dep_time),
    last_dep  = last(dep_time)
  )

Option C: rank within group

not_cancelled |> 
  group_by(year, month, day) |> 
  mutate(r = min_rank(desc(dep_time))) |> 
  filter(r %in% range(r))

Counting helpers

# Which destinations have the most carriers?
not_cancelled |> 
  group_by(dest) |> 
  summarize(n_carriers = n_distinct(carrier)) |> 
  arrange(desc(n_carriers))

# Counts by destination
not_cancelled |> 
  count(dest)
  • n() counts rows in each group
  • sum(!is.na(x)) counts non-missing values
  • n_distinct(x) counts distinct values
  • count(x) is a shortcut for group_by(x) |> summarize(n = n())

Counts inside ggplot: histogram bins

A histogram is basically “count by bins”.

ggplot(diamonds) +
  geom_histogram(aes(x = carat), binwidth = 0.5)
diamonds |> 
  count(cut_width(carat, 0.5))

Logical Summaries: Counts + Proportions

# How many flights left before 5am?
not_cancelled |> 
  group_by(year, month, day) |> 
  summarize(n_early = sum(dep_time < 500, na.rm = TRUE))
  • sum(condition) counts how many are TRUE
# What proportion of flights are delayed by more than an hour?
not_cancelled |> 
  group_by(year, month, day) |> 
  summarize(prop_over_60 = mean(arr_delay > 60, na.rm = TRUE))
  • mean(condition) gives the proportion TRUE (because TRUE = 1, FALSE = 0)

Grouping by Multiple Variables: “Peeling” Groups

Each summarize() removes one layer of grouping by default.

daily <- flights |> group_by(year, month, day)

per_day <- daily |> summarize(flights = n())
per_month <- per_day |> summarize(flights = sum(flights))
per_year <- per_month |> summarize(flights = sum(flights))

per_year

Ungrouping

Use ungroup() when you want to stop doing per-group operations.

daily |> 
  ungroup() |>               # no longer grouped by date
  summarize(flights = n())   # all flights in the dataset

Grouped mutate() and filter()

Grouped mutate() and filter()

Grouping is most common with summarize(), but it can also help with:

  • filtering within groups (e.g., “worst flights per day”)
  • creating per-group metrics (e.g., “share of delay within a destination”)

Example 1: “worst” within each group

Goal: find (about) the 10 worst arrival delays within each day.

flights |> 
  group_by(year, month, day) |>
  filter(dense_rank(desc(arr_delay)) <= 10)

Example 2: groups above a threshold

Goal: keep only destinations with lots of flights.

popular_dests <- flights |> 
  group_by(dest) |> 
  filter(n() > 17250)

popular_dests

Example 3: per-group metrics with mutate()

Goal: within each destination-day group, compute each flight’s share of total delay.

popular_dests |> 
  filter(arr_delay > 0) |> 
  group_by(year, month, day, dest) |> 
  mutate(prop_delay = arr_delay / sum(arr_delay, na.rm = TRUE)) |> 
  select(year:day, dest, arr_delay, prop_delay)