Lecture 3

Data Transformation with dplyr

Byeong-Hak Choe

bchoe@geneseo.edu

SUNY Geneseo

February 9, 2026

Reading a Data Table as a `data.frame`

📍 Absolute Pathnames

An absolute pathname tells the computer the full, exact location of a file, starting from the top-level folder of the computer.
- Because it starts from the top, it does not depend on your current working directory.

In R, you can check your working directory (the folder R is currently “looking in” by default) by running:

getwd()

✅ Examples (absolute path to `custdata_rev.csv`)

Mac

/Users/yourname/Documents/data/custdata_rev.csv

Windows

C:\\Users\\yourname\\Documents\\data\\custdata_rev.csv

📎 Relative Pathnames

A relative pathname gives a file location relative to the working directory.
- In many web-related projects (and collaborative projects), relative paths are almost necessary.

Suppose your working directory is

`/Users/yourname/Documents/my_project/`

and your file is stored at:

`/Users/yourname/Documents/my_project/data/custdata_rev.csv`

Then the relative path for the file custdata_rev.csv is:

`data/custdata_rev.csv`

🔎 Getting to Know a `data.frame`

custdata <- read_csv('https://bcdanl.github.io/data/custdata_rev.csv')

Basic structure + a single column

class(custdata)
custdata$age

Size of the `data.frame`

nrow(custdata)  # rows only
ncol(custdata)  # columns only

Column names

colnames(custdata)

The $ operator extracts a single column from a data.frame as a vector (e.g., custdata$age).

nrow()/ncol() returns the number of rows/columns.

colnames() returns a character vector of column names.

Quick Summary with `skimr::skim()`

# Load and summarize
library(skimr)
skim(custdata)

skimr::skim() quickly summarizes types, missing values, and key descriptive stats, and it works well with pipes and grouped data (e.g., custdata |> dplyr::group_by(state_or_res) |> skimr::skim()).

Tidy `data.frame`

table1  # a sample data.frame from the tidyr package (part of the tidyverse)

Rows represent individual units (observations) where data is collected.
Columns represent variables (attributes) measured for each observation.
Tidy data is easier to read, analyze, and share.
A data.frame is tidy when it follows three rules:
1. Each variable is a column.
2. Each observation is a row.
3. Each value is a cell.

Data Transformation

Data Transformation with `dplyr`

dplyr is a core tidyverse package for data manipulation — tasks like filtering, sorting, selecting, renaming, and many more. Common dplyr functions with data.frame DF:

filter(DF, LOGICAL_COND)
arrange(DF, VARIABLES)
distinct(DF, VARIABLES)
select(DF, VARIABLES)
relocate(DF, VARIABLES)
rename(DF, NEW_NAME = CURRENT_NAME)

mutate(DF, NEW_NAME = CURRENT_NAME)
group_by(DF, VARIABLES)
summarize(DF, NEW_NAME = CURRENT_NAME)
slice_*(DF, VARIABLE, n = ..)

dplyr functions take a data.frame as the first argument, and return a data.frame as output.

Making `dplyr` Code Flow with the Pipe Operator

The pipe operator (|> or %>%) makes code easier to read by connecting steps in order.
How it works:
- f(x, y) is the same as x |> f(y).
- Example with a data frame DF:
  - filter(DF, logical_condition)
  - is the same as
    DF |> filter(logical_condition)
You can read the pipe as “then”:
- Example: Take the data, then filter it

Data Transformation with the Pipe

DF |> filter(LOGICAL_CONDITIONS)
DF |> arrange(VARIABLES)
DF |> distinct(VARIABLES)
DF |> select(VARIABLES)
DF |> relocate(VARIABLES)
DF |> rename(NEW_NAME = CURRENT_NAME)
DF |> mutate(NEW_NAME = CURRENT_NAME)
DF |> group_by(VARIABLES)
DF |> summarize(NEW_NAME = CURRENT_NAME)
DF |> slice_*(VARIABLE, n = ..)
This allows you to chain together multiple steps naturally:

Start with data, then filter, then arrange

DF |> filter(...) |> arrange(...)

Using the Pipe in RStudio

To enable the native pipe operator (|>) in RStudio:
- Go to Tools > Global Options > Code (side menu).
- Under Pipe operator, choose Use native pipe operator (|>).

Keyboard shortcut for inserting a pipe:
- Windows: Ctrl + Shift + M
- Mac: Cmd + Shift + M

Filter observations with `filter()`

Filter Observations with `filter()`

install.packages("nycflights13")  # Install once
library(nycflights13)
library(tidyverse)

flights <- nycflights13::flights

flights$month == 1  # A logical test returns TRUE or FALSE
class(flights$month == 12)

# Flights in January or December
jan <- flights |> 
  filter(month == 1)

dec <- flights |> 
  filter(month == 12)

# Flights on January 1st or December 25th
jan1 <- flights |> 
  filter(month == 1, day == 1)

dec25 <- flights |> 
  filter(month == 12, day == 25)

filter() keeps only the observations that meet one or more logical conditions.
- A logical condition evaluates to TRUE (keep the observation) or FALSE (drop the observation).

Logical Conditions with Equality and Inequality

Here, both V1 and V2 are variables, and the comparisons are applied element-wise (vectorized).

For logical conditions using inequalities, we focus on cases where V1 and V2 are integer or numeric.

Logical Conditions - Example 1

df <- data.frame(
  num = c(8, 9, 10, 11),
  chr = c("A", "C", "B", "A")
  )

df |> 
  filter(num > 8)

df |> 
  filter(chr == "A")

Logical Operators — AND, OR, NOT

Here, both x and y are logical conditions/variables.

What logical operations (&, |, !) do is combining logical variables/conditions, which returns a logical variable when executed.

Logical Operations

x and y are logical conditions.
- If x is TRUE, it highlights the left circle.
- If y is TRUE, it highlights the right circle.

The shaded regions show which parts each logical operator selects.

Logical Conditions - Example 2

df <- data.frame(
  num = c(8, 9, 10, 11),
  chr = c("A", "C", "B", "A")
  )

df |> 
  filter(num > 8 & 
         num < 11)
                
df |> 
  filter(num > 8,
         num < 11)

In filter(), separating conditions with a comma is equivalent to combining them with the & operator.

Logical Conditions - Example 3

df <- data.frame(
  num = c(8, 9, 10, 11),
  chr = c("A", "C", "B", "A")
  )

df |> 
  filter(num < 10 & 
         chr == "A")

df |> 
  filter(num < 10, 
         chr == "A")

Logical Conditions - Example 4

df <- data.frame(
  num = c(8, 9, 10, 11),
  chr = c("A", "C", "B", "A")
  )

df |> 
  filter(num < 10 | 
         chr == "A")

Missing Values (`NA`)

NA > 5
10 == NA
NA + 10
NA / 2
(1 + NA + 3) / 3

mean( c(1, NA, 3) )
sd( c(1, NA, 3) )

NA (not available) represents a missing or unknown value in R.

In most calculations, if one value is unknown, the result will also be unknown (NA).

Comparing Missing Values

v1 <- NA
v2 <- NA
v1 == v2

Suppose v1 is Mary’s age (unknown) and v2 is John’s age (unknown).
Can we say they are the same age?

Since both values are missing, R cannot know — so the result is also NA.

Checking for Missing Values with `is.na()`

num_missing <- NA
is.na(num_missing)  # is num_missing NA?

text_missing <- "missing"
is.na(text_missing) # is text_missing NA?

V1 <- c(1, NA, 3)
is.na(V1)  # is V1 NA?
!is.na(V1)  # is V1 not NA?

df <- data.frame(
  v1 = c(1, NA, 3),
  v2 = c(1, 2, 3)
  )

df |> 
  filter( is.na(v1) )

df |> 
  filter( !is.na(v1) )

Use is.na() to test whether a value is missing (NA).

In filter(), you can:
- Use is.na() to keep observations with missing values.
- Use !is.na() to remove observations with missing values.

Arrange Observations with `arrange()`

# Sort observations 
  # by dep_delay (ascending)
flights |> 
  arrange(dep_delay)

# Sort by dep_delay, 
  # then by sched_dep_time
flights |> 
  arrange(dep_delay, sched_dep_time)

arrange() sorts out observations.
When you provide multiple variables, arrange() sorts by the first variable, and then uses the next variable(s) to break ties.

Descending Order with `desc()`

# Sort by departure delay 
  # in descending order
flights |> 
  arrange(desc(dep_delay))

# Equivalent shortcut 
  # for numeric variables
flights |> 
  arrange(-dep_delay)

Use desc(VARIABLE) to sort in descending order.
- For numeric variables, you can also use a leading minus sign.

Arrange Observations with `arrange()` - Example

df <- data.frame(
  year = c(2024, 2021, 2024, 2024),
  month = c(7, 10, 7, 4),
  day = c(20, 19, 15, 9)
)

df |> 
  arrange(year, month, day)

If we provide more than one variable name, each additional variable will be used to break ties in the values of preceding variables.

Find All Unique Observations with `distinct()`

df <- data.frame(
  country = c("USA", "Korea", "USA"),
  city = c("D.C.", "Seoul", "D.C."),
  subregion = c("Georgetown", 
                "Gangnam", 
                "Georgetown") 
  )

# Remove duplicate observations
df |> 
  distinct()

distinct() removes duplicate observations.
- By default, it checks all variables to find unique observations.

Find All Unique Combinations of Selected Variables

# Remove duplicate observations 
  # from the flights data.frame
flights |> 
  distinct()

# Find all unique 
  #  origin-destination pairs
flights |> 
  distinct(origin, dest)

You can pass one or more variables to distinct().
- This returns unique combinations of just those variables (while ignoring the rest).

Select Variables with `select()`

It’s not uncommon to get datasets with hundreds or thousands of variables.
select() allows us to narrow in on the variables we’re actually interested in.
We can select variables by their names.

flights |> 
  select(year, month, day)

Remove Variables with `select()`

flights |> 
  select(-year)

With select(-VARIABLES), we can remove variables.

Helper Functions with `select()`

There are a number of helper functions we can use within select():
- starts_with("abc"): matches names that begin with “abc”.
- ends_with("xyz"): matches names that end with “xyz”.
- contains("ijk"): matches names that contain “ijk”.
- num_range("x", 1:3): matches x1, x2 and x3.

Rename Variables with `rename()`

flights |> 
  rename( tail_num = tailnum )

rename() can be used to rename variables:
- DF |> rename(NEW_NAME = CURRENT_NAME)

Relocate Variables with `relocate()`

flights |> 
  relocate(time_hour, air_time)

We can use relocate() to move variables around.

`relocate()` with `.before` or `.after`

flights |> 
  relocate(year:dep_time, .after = time_hour)

flights |> 
  relocate(starts_with("arr"), .before = dep_time)

We can also use relocate() to move variables around.
- We can specify .before and .after arguments to choose where to put variables:

Add New Variables with `mutate()`

flights |> 
  mutate(
    gain = dep_delay - arr_delay,
    speed = distance / air_time * 60
  )

We can use the mutate() to add variables to the data.frame:

`mutate()` with `.before` or `.after`

flights |> 
  mutate(
    gain = dep_delay - arr_delay,
    speed = distance / air_time * 60,
    .before = 1  ## try different position numbers.
  )

flights |> 
  mutate(
    gain = dep_delay - arr_delay,
    speed = distance / air_time * 60,
    .after = day  
  )

We can use the .before or .after argument to add the variables to the position of a column
The . is a sign that .before is an argument to the function, not the name of variable.
In both .before and .after, we can use the variable name instead of a position number.

`mutate()` with Offsets and Cumulative Functions

Offsets let you reference the previous or next value (useful for changes over time). Cumulative functions build running totals (or running mins/maxes).

df <- data.frame(x = 1:10,
                 y = c(2,4,6,8,10,
                       1,3,5,7,9))
df <- df |> 
  mutate(
    x_lag = lag(x),  # previous value
    x_lead = lead(x),  # next value
    y_cumsum = cumsum(y),    # running sum
    y_cummean = cummean(y),   # running average
    y_cummin = cummin(y),    # running minimum
    y_cummax = cummax(y)    # running maximum
  )

`mutate()` with Ranking Functions

Ranking helpers are great for creating order, percentiles, or within-group standings.

rank_me <- data.frame(x = c(10, 5, 1, 5, 5, NA))

rank_me |>
  mutate(
    id = row_number(),             # row id (1, 2, 3, ...)
    x_row_number = row_number(x),  # tie-breaking rank (arbitrary but stable)
    x_min_rank   = min_rank(x),    # ranks with gaps when ties occur
    x_dense_rank = dense_rank(x),  # ranks without gaps when ties occur
    x_percent    = percent_rank(x),# 0 to 1 (excluding 1 unless max)
    x_cume       = cume_dist(x)    # 0 to 1 (including 1 for max)
  )

`mutate()` with `if_else()` or `ifelse()`

Use conditional functions to create a new variable based on a rule.

`if_else()`

flight_season <- flights |>
  mutate(
    summer_month = 
      if_else(month %in% c(6, 7, 8), 
              TRUE, FALSE)
  )

Strict: both outputs must be the same type (both logical, both numeric, both character, etc.).

`ifelse()`

flight_season <- flights |>
  mutate(
    summer_month = 
      ifelse(month %in% c(6, 7, 8), 
             TRUE, 0)  # coerces to 
  )                    #    numeric

More flexible (will coerce types if needed), but can be easier to misuse.

if_else() vs. ifelse()

Use if_else() when you want type safety (both outcomes must match).
Use ifelse() only when you intentionally accept coercion

`mutate()` with `case_when()`

scores <- tibble(score = c(95, 82, 74, 61, NA))
scores |>
  mutate(
    grade = case_when(
      is.na(score)      ~ "missing",
      score >= 90       ~ "A",
      score >= 80       ~ "B",
      score >= 70       ~ "C",
      score >= 60       ~ "D",
      TRUE              ~ "F"
    )
  )

Use case_when() when you have 3+ conditions (an “if-else” chain).
Order matters: the first matching condition is used.
Recommend including a default case: TRUE ~ ...
- Without TRUE ~ ..., any observation that doesn’t match your listed conditions becomes NA.

Grouped Summaries with `summarize()`

`summarize()` in one sentence

summarize() reduces rows by computing summary statistics.
Without group_by(), you get one row for the entire dataset.
With group_by(...), you get one row per group.

# One row (entire dataset)
flights |> 
  summarize(mean_dep_delay = mean(dep_delay, na.rm = TRUE))

# One row per day
flights |>
  group_by(year, month, day) |>
  summarize(mean_dep_delay = mean(dep_delay, na.rm = TRUE))

Mental model

group_by() splits → summarize() reduces → output combines

What Happens When We Group Then Summarize?

Split-Apply-Combine

df <- data.frame(
  cat1 = c(rep("a",3), rep("b", 3), rep("c", 3)),
  cat2 = c("j", "j", "k", "j", "k", "k", "j", "j", "k"),
  x = 1:9
)

df |> 
  group_by(cat2) |> 
  summarize(
    avg   = mean(x),
    total = sum(x),
    n     = n()
  )

Key idea

group_by(cat2) splits the data into groups (one group per cat2 value).
summarize(...) collapses each group down to one row (one row per group).

Example: Destination-level Summaries

Goal: For each destination (dest), compute
(1) number of flights, (2) average distance, (3) average arrival delay.

by_dest <- flights |> 
  group_by(dest) |> 
  summarize(
    n_flights = n(),
    mean_dist = mean(distance, na.rm = TRUE),
    mean_arr_delay = mean(arr_delay, na.rm = TRUE)
  )

Visualize the Relationship

by_dest |> 
  filter(n_flights > 20, dest != "HNL") |> 
  ggplot(aes(x = mean_dist, y = mean_arr_delay)) +
  geom_point(aes(size = n_flights), alpha = 1/3) +
  geom_smooth(se = FALSE)

Missing Values: Why `na.rm = TRUE` Matters

If you don’t remove missing values, many summaries become NA.

# Missing values included → result may be NA
flights |> 
  group_by(year, month, day) |> 
  summarize(mean_dep_delay = mean(dep_delay))

# Missing values removed → numeric result
flights |> 
  group_by(year, month, day) |> 
  summarize(mean_dep_delay = mean(dep_delay, na.rm = TRUE))

If Missing Means “cancelled”

A common approach is to create a dataset of not-cancelled flights first.

not_cancelled <- flights |> 
  filter(!is.na(dep_delay), !is.na(arr_delay))

Counts: always include a size check

When you compute an average, include a count so you don’t trust tiny groups.

delays_by_plane <- not_cancelled |> 
  group_by(tailnum) |> 
  summarize(
    mean_arr_delay = mean(arr_delay, na.rm = TRUE),
    n_flights = n()
  )

ggplot(delays_by_plane, 
       aes(x = n_flights, 
           y = mean_arr_delay)) +
  geom_point(alpha = 1/10)

delays_by_plane |> 
  filter(n_flights > 25) |> 
  ggplot(aes(x = n_flights, 
             y = mean_arr_delay)) +
  geom_point(alpha = 1/10)

Useful Summary Functions: Location (Center)

Mean: `mean(x)`

not_cancelled |> 
  summarize(mean_arr_delay = mean(arr_delay, na.rm = TRUE))

Average value (adds values and divides by count)
Use na.rm = TRUE if there are missing values

Median: `median(x)`

not_cancelled |> 
  summarize(median_arr_delay = median(arr_delay, na.rm = TRUE))

Middle value (50th percentile)
More robust to outliers than the mean

Useful Summary Functions: Spread (Variation)

Standard deviation: `sd(x)`

not_cancelled |> 
  summarize(sd_arr_delay = sd(arr_delay, na.rm = TRUE))

Typical distance from the mean
Bigger sd → more variability

Interquartile range: `IQR(x)`

not_cancelled |> 
  summarize(iqr_arr_delay = IQR(arr_delay, na.rm = TRUE))

Spread of the middle 50%: Q3 - Q1
Robust to extreme outliers

Minimum / maximum: `min(x)`, `max(x)`

not_cancelled |> 
  group_by(year, month, day) |> 
  summarize(
    earliest_dep = min(dep_time, na.rm = TRUE),
    latest_dep   = max(dep_time, na.rm = TRUE)
  )

Smallest and largest values in a group

Quantiles: `quantile(x, p)`

not_cancelled |> 
  summarize(
    q25_arr_delay = quantile(arr_delay, 0.25, na.rm = TRUE),
    q75_arr_delay = quantile(arr_delay, 0.75, na.rm = TRUE)
  )

Value at percentile p (e.g., 25th percentile)

Range: `range(x)`

not_cancelled |> 
  summarize(dep_time_range = list(range(dep_time, na.rm = TRUE)))

Returns both min and max together

First / last: `first(x)`, `last(x)`

not_cancelled |> 
  arrange(year, month, day, dep_time) |> 
  group_by(year, month, day) |> 
  summarize(
    first_dep = first(dep_time),
    last_dep  = last(dep_time)
  )

First and last values in the current row order
If you need a specific definition of “first”, sort first!

Nth: `nth(x, n)`

not_cancelled |> 
  arrange(year, month, day, dep_time) |> 
  group_by(year, month, day) |> 
  summarize(second_dep = nth(dep_time, 2))

The n-th value in the current order

Conditional Subsetting Inside a Summary

Example: `arr_delay[arr_delay > 0]`

not_cancelled |> 
  group_by(year, month, day) |> 
  summarize(
    avg_delay = mean(arr_delay, na.rm = TRUE),
    avg_pos_delay = mean(arr_delay[arr_delay > 0], na.rm = TRUE)
  )

Inside a summary, you can “filter” a single column using brackets []
arr_delay[arr_delay > 0] means:
keep only the values of arr_delay that are greater than 0

Why do this?

avg_delay can be pulled down by early flights (negative delays)
avg_pos_delay answers: “When flights are late, how late are they on average?”

When do the first and last flights leave each day?

Option A: min / max (robust)

not_cancelled |> 
  group_by(year, month, day) |> 
  summarize(
    first_dep = min(dep_time, na.rm = TRUE),
    last_dep  = max(dep_time, na.rm = TRUE)
  )

Option B: first / last (order-dependent)

not_cancelled |> 
  group_by(year, month, day) |> 
  summarize(
    first_dep = first(dep_time),
    last_dep  = last(dep_time)
  )

Option C: rank within group

not_cancelled |> 
  group_by(year, month, day) |> 
  mutate(r = min_rank(desc(dep_time))) |> 
  filter(r %in% range(r))

Counting helpers

# Which destinations have the most carriers?
not_cancelled |> 
  group_by(dest) |> 
  summarize(n_carriers = n_distinct(carrier)) |> 
  arrange(desc(n_carriers))

# Counts by destination
not_cancelled |> 
  count(dest)

n() counts rows in each group
sum(!is.na(x)) counts non-missing values
n_distinct(x) counts distinct values
count(x) is a shortcut for group_by(x) |> summarize(n = n())

Counts inside ggplot: histogram bins

A histogram is basically “count by bins”.

ggplot(diamonds) +
  geom_histogram(aes(x = carat), binwidth = 0.5)

diamonds |> 
  count(cut_width(carat, 0.5))

Logical Summaries: Counts + Proportions

# How many flights left before 5am?
not_cancelled |> 
  group_by(year, month, day) |> 
  summarize(n_early = sum(dep_time < 500, na.rm = TRUE))

sum(condition) counts how many are TRUE

# What proportion of flights are delayed by more than an hour?
not_cancelled |> 
  group_by(year, month, day) |> 
  summarize(prop_over_60 = mean(arr_delay > 60, na.rm = TRUE))

mean(condition) gives the proportion TRUE (because TRUE = 1, FALSE = 0)

Grouping by Multiple Variables: “Peeling” Groups

Each summarize() removes one layer of grouping by default.

daily <- flights |> group_by(year, month, day)

per_day <- daily |> summarize(flights = n())
per_month <- per_day |> summarize(flights = sum(flights))
per_year <- per_month |> summarize(flights = sum(flights))

per_year

Ungrouping

Use ungroup() when you want to stop doing per-group operations.

daily |> 
  ungroup() |>               # no longer grouped by date
  summarize(flights = n())   # all flights in the dataset

Grouped `mutate()` and `filter()`

Grouping is most common with summarize(), but it can also help with:

filtering within groups (e.g., “worst flights per day”)
creating per-group metrics (e.g., “share of delay within a destination”)

Example 1: “worst” within each group

Goal: find (about) the 10 worst arrival delays within each day.

flights |> 
  group_by(year, month, day) |>
  filter(dense_rank(desc(arr_delay)) <= 10)

Example 2: groups above a threshold

Goal: keep only destinations with lots of flights.

popular_dests <- flights |> 
  group_by(dest) |> 
  filter(n() > 17250)

popular_dests

Example 3: per-group metrics with `mutate()`

Goal: within each destination-day group, compute each flight’s share of total delay.

popular_dests |> 
  filter(arr_delay > 0) |> 
  group_by(year, month, day, dest) |> 
  mutate(prop_delay = arr_delay / sum(arr_delay, na.rm = TRUE)) |> 
  select(year:day, dest, arr_delay, prop_delay)

Grouped `slice_*()`

The `slice_*()` Functions

There are five handy functions that allow you to extract specific rows within each group:

grouped_df |> slice_head(n = 1) takes the first row.
grouped_df |> slice_tail(n = 1) takes the last row.
grouped_df |> slice_min(x, n = 1) takes the row with the smallest value of column x.
grouped_df |> slice_max(x, n = 1) takes the row with the largest value of column x.
grouped_df |> slice_sample(n = 1) takes one random row.
You can vary n to select more than one row.
Instead of n =, you can use prop = 0.1 to select (e.g.) 10% of the rows in each group.

`slice_head()`

flights |> 
  group_by(dest) |> 
  slice_head(n = 2) |> 
  relocate(dest)

flights |> 
  group_by(dest) |> 
  slice_head(prop = 0.02) |> 
  relocate(dest)

slice_head(n = 2) returns the first 2 rows for each destination group.
Useful for quickly previewing the earliest recorded observations per group.

`slice_tail()`

flights |> 
  group_by(dest) |> 
  slice_tail(n = 2) |> 
  relocate(dest)

flights |> 
  group_by(dest) |> 
  slice_tail(prop = 0.02) |> 
  relocate(dest)

slice_tail(n = 2) returns the last 2 rows for each destination group.
Useful for inspecting the most recently recorded observations per group.

`slice_min()`

flights |> 
  group_by(dest) |> 
  slice_min(arr_delay, n = 1) |>
  relocate(dest)

flights |> 
  group_by(dest) |> 
  slice_min(arr_delay, prop = 0.05) |>
  relocate(dest)

slice_min(arr_delay, n = 1) returns the flight with the smallest arrival delay for each destination.
Ties are kept by default.
- Use with_ties = FALSE to return exactly n rows.

`slice_max()`

flights |> 
  group_by(dest) |> 
  slice_max(arr_delay, n = 1) |>
  relocate(dest)

flights |> 
  group_by(dest) |> 
  slice_max(arr_delay, prop = 0.05) |>
  relocate(dest)

slice_max(arr_delay, n = 1) returns the flight with the longest arrival delay for each destination.
Ties are kept by default.
- Use with_ties = FALSE to return exactly n rows.
Great for identifying worst-case observations within each group.

`slice_sample()`

set.seed(1)

flights |> 
  group_by(dest) |> 
  slice_sample(n = 1) |> 
  relocate(dest)

set.seed(1)

flights |> 
  group_by(dest) |> 
  slice_sample(prop = 0.1) |> 
  relocate(dest)

slice_sample(n = 1) draws one random flight from each destination group.
Use prop = 0.1 to randomly sample 10% of rows per group instead of a fixed n.
Results will differ each run — set set.seed() beforehand for reproducibility.

Lecture 3

Reading a Data Table as a data.frame

📍 Absolute Pathnames

✅ Examples (absolute path to custdata_rev.csv)

📎 Relative Pathnames

🔎 Getting to Know a data.frame

Basic structure + a single column

Size of the data.frame

Column names

Quick Summary with skimr::skim()

Tidy data.frame

Data Transformation

Data Transformation with dplyr

Making dplyr Code Flow with the Pipe Operator

Data Transformation with the Pipe

Using the Pipe in RStudio

Filter observations with filter()

Filter Observations with filter()

Logical Conditions with Equality and Inequality

Logical Conditions - Example 1

Logical Operators — AND, OR, NOT

Logical Operations

Logical Conditions - Example 2

Logical Conditions - Example 3

Logical Conditions - Example 4

Missing Values (NA)

Comparing Missing Values

Checking for Missing Values with is.na()

Arrange Observations with arrange()

Arrange Observations with arrange()

Descending Order with desc()

Arrange Observations with arrange() - Example

Find All Unique Observations with distinct()

Find All Unique Observations with distinct()

Find All Unique Combinations of Selected Variables

Select Variables with select()

Select Variables with select()

Remove Variables with select()

Helper Functions with select()

Rename Variables with rename()

Rename Variables with rename()

Relocate Variables with relocate()

Relocate Variables with relocate()

relocate() with .before or .after

Add New Variables with mutate()

Add New Variables with mutate()

mutate() with .before or .after

mutate() with Offsets and Cumulative Functions

mutate() with Ranking Functions

mutate() with if_else() or ifelse()

if_else()

ifelse()

mutate() with case_when()

Grouped Summaries with summarize()

summarize() in one sentence

What Happens When We Group Then Summarize?

Split-Apply-Combine

Example: Destination-level Summaries

Visualize the Relationship

Missing Values: Why na.rm = TRUE Matters

If Missing Means “cancelled”

Counts: always include a size check

Useful Summary Functions: Location (Center)

Mean: mean(x)

Median: median(x)

Useful Summary Functions: Spread (Variation)

Standard deviation: sd(x)

Interquartile range: IQR(x)

Minimum / maximum: min(x), max(x)

Quantiles: quantile(x, p)

Range: range(x)

First / last: first(x), last(x)

Nth: nth(x, n)

Conditional Subsetting Inside a Summary

Example: arr_delay[arr_delay > 0]

Why do this?

When do the first and last flights leave each day?

Option A: min / max (robust)

Option B: first / last (order-dependent)

Option C: rank within group

Reading a Data Table as a `data.frame`

✅ Examples (absolute path to `custdata_rev.csv`)

🔎 Getting to Know a `data.frame`

Size of the `data.frame`

Quick Summary with `skimr::skim()`

Tidy `data.frame`

Data Transformation with `dplyr`

Making `dplyr` Code Flow with the Pipe Operator

Filter observations with `filter()`

Filter Observations with `filter()`

Missing Values (`NA`)

Checking for Missing Values with `is.na()`

Arrange Observations with `arrange()`

Arrange Observations with `arrange()`

Descending Order with `desc()`

Arrange Observations with `arrange()` - Example

Find All Unique Observations with `distinct()`

Find All Unique Observations with `distinct()`

Select Variables with `select()`

Select Variables with `select()`

Remove Variables with `select()`

Helper Functions with `select()`

Rename Variables with `rename()`

Rename Variables with `rename()`

Relocate Variables with `relocate()`

Relocate Variables with `relocate()`

`relocate()` with `.before` or `.after`

Add New Variables with `mutate()`

Add New Variables with `mutate()`

`mutate()` with `.before` or `.after`

`mutate()` with Offsets and Cumulative Functions

`mutate()` with Ranking Functions

`mutate()` with `if_else()` or `ifelse()`

`if_else()`

`ifelse()`

`mutate()` with `case_when()`

Grouped Summaries with `summarize()`

`summarize()` in one sentence

Missing Values: Why `na.rm = TRUE` Matters

Mean: `mean(x)`

Median: `median(x)`

Standard deviation: `sd(x)`

Interquartile range: `IQR(x)`

Minimum / maximum: `min(x)`, `max(x)`

Quantiles: `quantile(x, p)`

Range: `range(x)`

First / last: `first(x)`, `last(x)`

Nth: `nth(x, n)`

Example: `arr_delay[arr_delay > 0]`

Grouped `mutate()` and `filter()`

Grouped `mutate()` and `filter()`

Example 3: per-group metrics with `mutate()`

Grouped `slice_*()`

The `slice_*()` Functions

`slice_head()`

`slice_tail()`

`slice_min()`

`slice_max()`

`slice_sample()`