Code
# install.packages("tidyverse")Data Transformation
YOUR NAME
January 25, 2026
This document introduces the dplyr workflow for cleaning, transforming, and summarizing data in R.
You will learn how to:
select()filter()arrange()mutate()summarise()group_by()We will use the built-in dataset mpg.
Rows: 234
Columns: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
$ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
$ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
$ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
$ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
$ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
$ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
$ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
$ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
$ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
$ class <chr> "compact", "compact", "compact", "compact", "compact", "c…
|>The pipe sends the output of one step into the next step.
# A tibble: 234 Ă— 3
manufacturer model year
<chr> <chr> <int>
1 audi a4 1999
2 audi a4 1999
3 audi a4 2008
4 audi a4 2008
5 audi a4 1999
6 audi a4 1999
7 audi a4 2008
8 audi a4 quattro 1999
9 audi a4 quattro 1999
10 audi a4 quattro 2008
# ℹ 224 more rows
This is equivalent to:
# A tibble: 234 Ă— 3
manufacturer model year
<chr> <chr> <int>
1 audi a4 1999
2 audi a4 1999
3 audi a4 2008
4 audi a4 2008
5 audi a4 1999
6 audi a4 1999
7 audi a4 2008
8 audi a4 quattro 1999
9 audi a4 quattro 1999
10 audi a4 quattro 2008
# ℹ 224 more rows
âś… Use pipes to write code in a clear, step-by-step style.
select() (Choose Columns)# A tibble: 234 Ă— 4
manufacturer model displ hwy
<chr> <chr> <dbl> <int>
1 audi a4 1.8 29
2 audi a4 1.8 29
3 audi a4 2 31
4 audi a4 2 30
5 audi a4 2.8 26
6 audi a4 2.8 26
7 audi a4 3.1 27
8 audi a4 quattro 1.8 26
9 audi a4 quattro 1.8 25
10 audi a4 quattro 2 28
# ℹ 224 more rows
# A tibble: 234 Ă— 4
manufacturer model displ year
<chr> <chr> <dbl> <int>
1 audi a4 1.8 1999
2 audi a4 1.8 1999
3 audi a4 2 2008
4 audi a4 2 2008
5 audi a4 2.8 1999
6 audi a4 2.8 1999
7 audi a4 3.1 2008
8 audi a4 quattro 1.8 1999
9 audi a4 quattro 1.8 1999
10 audi a4 quattro 2 2008
# ℹ 224 more rows
# A tibble: 234 Ă— 9
manufacturer model displ year cyl trans drv fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <chr> <chr>
1 audi a4 1.8 1999 4 auto(l5) f p compact
2 audi a4 1.8 1999 4 manual(m5) f p compact
3 audi a4 2 2008 4 manual(m6) f p compact
4 audi a4 2 2008 4 auto(av) f p compact
5 audi a4 2.8 1999 6 auto(l5) f p compact
6 audi a4 2.8 1999 6 manual(m5) f p compact
7 audi a4 3.1 2008 6 auto(av) f p compact
8 audi a4 quattro 1.8 1999 4 manual(m5) 4 p compact
9 audi a4 quattro 1.8 1999 4 auto(l5) 4 p compact
10 audi a4 quattro 2 2008 4 manual(m6) 4 p compact
# ℹ 224 more rows
filter() (Choose Rows)# A tibble: 47 Ă— 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
3 audi a4 2 2008 4 manu… f 20 31 p comp…
4 audi a4 2 2008 4 auto… f 21 30 p comp…
5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
# ℹ 37 more rows
# A tibble: 22 Ă— 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 2 2008 4 manu… f 20 31 p comp…
2 audi a4 2 2008 4 auto… f 21 30 p comp…
3 audi a4 3.1 2008 6 auto… f 18 27 p comp…
4 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
5 audi a4 quattro 2 2008 4 auto… 4 19 27 p comp…
6 audi a4 quattro 3.1 2008 6 auto… 4 17 25 p comp…
7 audi a4 quattro 3.1 2008 6 manu… 4 15 25 p comp…
8 subaru impreza a… 2.5 2008 4 auto… 4 20 25 p comp…
9 subaru impreza a… 2.5 2008 4 auto… 4 20 27 r comp…
10 subaru impreza a… 2.5 2008 4 manu… 4 19 25 p comp…
# ℹ 12 more rows
%in%# A tibble: 109 Ă— 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
3 audi a4 2 2008 4 manu… f 20 31 p comp…
4 audi a4 2 2008 4 auto… f 21 30 p comp…
5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
# ℹ 99 more rows
arrange() (Sort Rows)# A tibble: 234 Ă— 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 dodge dakota pi… 4.7 2008 8 auto… 4 9 12 e pick…
2 dodge durango 4… 4.7 2008 8 auto… 4 9 12 e suv
3 dodge ram 1500 … 4.7 2008 8 auto… 4 9 12 e pick…
4 dodge ram 1500 … 4.7 2008 8 manu… 4 9 12 e pick…
5 jeep grand che… 4.7 2008 8 auto… 4 9 12 e suv
6 chevrolet k1500 tah… 5.3 2008 8 auto… 4 11 14 e suv
7 jeep grand che… 6.1 2008 8 auto… 4 11 14 p suv
8 chevrolet c1500 sub… 5.3 2008 8 auto… r 11 15 e suv
9 chevrolet k1500 tah… 5.7 1999 8 auto… 4 11 15 r suv
10 dodge dakota pi… 5.2 1999 8 auto… 4 11 15 r pick…
# ℹ 224 more rows
# A tibble: 234 Ă— 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 volkswagen jetta 1.9 1999 4 manu… f 33 44 d comp…
2 volkswagen new beetle 1.9 1999 4 manu… f 35 44 d subc…
3 volkswagen new beetle 1.9 1999 4 auto… f 29 41 d subc…
4 toyota corolla 1.8 2008 4 manu… f 28 37 r comp…
5 honda civic 1.8 2008 4 auto… f 25 36 r subc…
6 honda civic 1.8 2008 4 auto… f 24 36 c subc…
7 toyota corolla 1.8 1999 4 manu… f 26 35 r comp…
8 toyota corolla 1.8 2008 4 auto… f 26 35 r comp…
9 honda civic 1.8 2008 4 manu… f 26 34 r subc…
10 honda civic 1.6 1999 4 manu… f 28 33 r subc…
# ℹ 224 more rows
# A tibble: 234 Ă— 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 chevrolet corvette 5.7 1999 8 manual… r 16 26 p 2sea…
2 chevrolet corvette 6.2 2008 8 manual… r 16 26 p 2sea…
3 chevrolet corvette 6.2 2008 8 auto(s… r 15 25 p 2sea…
4 chevrolet corvette 7 2008 8 manual… r 15 24 p 2sea…
5 chevrolet corvette 5.7 1999 8 auto(l… r 15 23 p 2sea…
6 volkswagen jetta 1.9 1999 4 manual… f 33 44 d comp…
7 toyota corolla 1.8 2008 4 manual… f 28 37 r comp…
8 toyota corolla 1.8 1999 4 manual… f 26 35 r comp…
9 toyota corolla 1.8 2008 4 auto(l… f 26 35 r comp…
10 toyota corolla 1.8 1999 4 auto(l… f 24 33 r comp…
# ℹ 224 more rows
mutate() (Create New Variables)# A tibble: 234 Ă— 12
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
3 audi a4 2 2008 4 manu… f 20 31 p comp…
4 audi a4 2 2008 4 auto… f 21 30 p comp…
5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
# ℹ 224 more rows
# ℹ 1 more variable: hwy_per_liter <dbl>
# A tibble: 234 Ă— 13
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
3 audi a4 2 2008 4 manu… f 20 31 p comp…
4 audi a4 2 2008 4 auto… f 21 30 p comp…
5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
# ℹ 224 more rows
# ℹ 2 more variables: hwy_per_liter <dbl>, cty_per_liter <dbl>
if_else() for conditional logic# A tibble: 2 Ă— 2
big_engine n
<chr> <int>
1 No 148
2 Yes 86
summarise() (Compute Summary Statistics)# A tibble: 1 Ă— 4
mean_hwy sd_hwy max_hwy min_hwy
<dbl> <dbl> <int> <int>
1 23.4 5.95 44 12
âś… Tip: Use na.rm = TRUE if your data has missing values.
group_by() + summarise() (Grouped Summaries)# A tibble: 3 Ă— 2
drv n
<chr> <int>
1 4 103
2 f 106
3 r 25
count() (Fast Frequency Tables)# A tibble: 7 Ă— 2
class n
<chr> <int>
1 suv 62
2 compact 47
3 midsize 41
4 subcompact 35
5 pickup 33
6 minivan 11
7 2seater 5
With two variables:
# A tibble: 12 Ă— 3
class drv n
<chr> <chr> <int>
1 suv 4 51
2 midsize f 38
3 compact f 35
4 pickup 4 33
5 subcompact f 22
6 compact 4 12
7 minivan f 11
8 suv r 11
9 subcompact r 9
10 2seater r 5
11 subcompact 4 4
12 midsize 4 3
distinct() (Unique Values)# A tibble: 15 Ă— 1
manufacturer
<chr>
1 audi
2 chevrolet
3 dodge
4 ford
5 honda
6 hyundai
7 jeep
8 land rover
9 lincoln
10 mercury
11 nissan
12 pontiac
13 subaru
14 toyota
15 volkswagen
Unique combinations:
# A tibble: 30 Ă— 2
manufacturer year
<chr> <int>
1 audi 1999
2 audi 2008
3 chevrolet 2008
4 chevrolet 1999
5 dodge 1999
6 dodge 2008
7 ford 1999
8 ford 2008
9 honda 1999
10 honda 2008
# ℹ 20 more rows
slice_*() (Pick Specific Rows)# A tibble: 5 Ă— 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
# A tibble: 5 Ă— 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 volkswagen passat 2 2008 4 auto(s6) f 19 28 p mids…
2 volkswagen passat 2 2008 4 manual(m6) f 21 29 p mids…
3 volkswagen passat 2.8 1999 6 auto(l5) f 16 26 p mids…
4 volkswagen passat 2.8 1999 6 manual(m5) f 18 26 p mids…
5 volkswagen passat 3.6 2008 6 auto(s6) f 17 26 p mids…
# A tibble: 10 Ă— 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 volkswagen jetta 1.9 1999 4 manu… f 33 44 d comp…
2 volkswagen new beetle 1.9 1999 4 manu… f 35 44 d subc…
3 volkswagen new beetle 1.9 1999 4 auto… f 29 41 d subc…
4 toyota corolla 1.8 2008 4 manu… f 28 37 r comp…
5 honda civic 1.8 2008 4 auto… f 25 36 r subc…
6 honda civic 1.8 2008 4 auto… f 24 36 c subc…
7 toyota corolla 1.8 1999 4 manu… f 26 35 r comp…
8 toyota corolla 1.8 2008 4 auto… f 26 35 r comp…
9 honda civic 1.8 2008 4 manu… f 26 34 r subc…
10 honda civic 1.6 1999 4 manu… f 28 33 r subc…
across() (Apply Functions to Many Columns)Example: compute means for multiple numeric columns.
# A tibble: 1 Ă— 3
cty hwy displ
<dbl> <dbl> <dbl>
1 16.9 23.4 3.47
You can combine with group_by():
# A tibble: 7 Ă— 3
class cty hwy
<chr> <dbl> <dbl>
1 2seater 15.4 24.8
2 compact 20.1 28.3
3 midsize 18.8 27.3
4 minivan 15.8 22.4
5 pickup 13 16.9
6 subcompact 20.4 28.1
7 suv 13.5 18.1
Joins combine tables using a matching key column.
We will create two small example tables.
left_join() (most common)inner_join()anti_join() (find non-matches)Goal: find the best average highway MPG by car class (with sample size).
# A tibble: 7 Ă— 3
class mean_hwy n
<chr> <dbl> <int>
1 compact 28.3 47
2 subcompact 28.1 35
3 midsize 27.3 41
4 2seater 24.8 5
5 minivan 22.4 11
6 suv 18.1 62
7 pickup 16.9 33
select() to keep only manufacturer, model, year, hwy.drv == "f" and year == 2008.hwy (descending).efficiency = hwy / displ.efficiency.hwy by class and sort from highest to lowest.hwy.Starter code:
---
title: "dplyr Basics"
subtitle: "Data Transformation"
author: "YOUR NAME"
date: last-modified
format:
html:
toc: true
number-sections: true
code-fold: true
code-tools: true
code-summary: "Show the code"
highlight-style: atom-one
execute:
echo: true
warning: false
message: false
---
# Getting Started
This document introduces the **dplyr** workflow for cleaning, transforming, and summarizing data in R.
You will learn how to:
- select columns with `select()`
- filter rows with `filter()`
- sort data with `arrange()`
- create new variables with `mutate()`
- summarize data with `summarise()`
- group operations with `group_by()`
- combine datasets with joins
<br>
# Setup
## Install (one-time)
```{r}
# install.packages("tidyverse")
```
## Load packages
```{r}
library(tidyverse)
```
<br>
# A Quick Look at a Dataset
We will use the built-in dataset **`mpg`**.
```{r}
mpg |> glimpse()
```
<br>
# The Pipe Operator `|>`
The pipe sends the output of one step into the next step.
```{r}
mpg |>
select(manufacturer, model, year)
```
This is equivalent to:
```{r}
select(mpg, manufacturer, model, year)
```
âś… Use pipes to write code in a clear, step-by-step style.
<br>
# `select()` (Choose Columns)
## Select a few columns
```{r}
mpg |>
select(manufacturer, model, displ, hwy)
```
## Select a range of columns
```{r}
mpg |>
select(manufacturer:year)
```
## Remove columns
```{r}
mpg |>
select(-cty, -hwy)
```
<br>
# `filter()` (Choose Rows)
## Filter by a condition
```{r}
mpg |>
filter(class == "compact")
```
## Multiple conditions with AND
```{r}
mpg |>
filter(class == "compact", year == 2008)
```
## OR conditions using `%in%`
```{r}
mpg |>
filter(class %in% c("compact", "suv"))
```
<br>
# `arrange()` (Sort Rows)
## Sort ascending
```{r}
mpg |>
arrange(hwy)
```
## Sort descending
```{r}
mpg |>
arrange(desc(hwy))
```
## Sort by multiple variables
```{r}
mpg |>
arrange(class, desc(hwy))
```
<br>
# `mutate()` (Create New Variables)
## Create a new variable
```{r}
mpg |>
mutate(hwy_per_liter = hwy / displ)
```
## Create multiple new variables
```{r}
mpg |>
mutate(
hwy_per_liter = hwy / displ,
cty_per_liter = cty / displ
)
```
## Use `if_else()` for conditional logic
```{r}
mpg |>
mutate(
big_engine = if_else(displ >= 4, "Yes", "No")
) |>
count(big_engine)
```
<br>
# `summarise()` (Compute Summary Statistics)
## One summary value
```{r}
mpg |>
summarise(mean_hwy = mean(hwy))
```
## Multiple summaries
```{r}
mpg |>
summarise(
mean_hwy = mean(hwy),
sd_hwy = sd(hwy),
max_hwy = max(hwy),
min_hwy = min(hwy)
)
```
âś… Tip: Use `na.rm = TRUE` if your data has missing values.
<br>
# `group_by()` + `summarise()` (Grouped Summaries)
## Mean highway MPG by class
```{r}
mpg |>
group_by(class) |>
summarise(mean_hwy = mean(hwy), .groups = "drop") |>
arrange(desc(mean_hwy))
```
## Count rows by group
```{r}
mpg |>
group_by(drv) |>
summarise(n = n(), .groups = "drop")
```
<br>
# `count()` (Fast Frequency Tables)
```{r}
mpg |>
count(class, sort = TRUE)
```
With two variables:
```{r}
mpg |>
count(class, drv, sort = TRUE)
```
<br>
# `distinct()` (Unique Values)
```{r}
mpg |>
distinct(manufacturer)
```
Unique combinations:
```{r}
mpg |>
distinct(manufacturer, year)
```
<br>
# `slice_*()` (Pick Specific Rows)
## Top rows
```{r}
mpg |>
slice_head(n = 5)
```
## Bottom rows
```{r}
mpg |>
slice_tail(n = 5)
```
## Top rows by a variable
```{r}
mpg |>
arrange(desc(hwy)) |>
slice_head(n = 10)
```
<br>
# `across()` (Apply Functions to Many Columns)
Example: compute means for multiple numeric columns.
```{r}
mpg |>
summarise(
across(c(cty, hwy, displ), mean)
)
```
You can combine with `group_by()`:
```{r}
mpg |>
group_by(class) |>
summarise(
across(c(cty, hwy), mean),
.groups = "drop"
)
```
<br>
# Joining Data (Combining Tables)
Joins combine tables using a matching key column.
We will create two small example tables.
```{r}
students <- tibble(
id = c(101, 102, 103),
name = c("Alex", "Bella", "Chris")
)
grades <- tibble(
id = c(101, 103),
score = c(95, 88)
)
```
## `left_join()` (most common)
```{r}
students |>
left_join(grades, by = "id")
```
## `inner_join()`
```{r}
students |>
inner_join(grades, by = "id")
```
## `anti_join()` (find non-matches)
```{r}
students |>
anti_join(grades, by = "id")
```
<br>
# A Typical dplyr Workflow Example
Goal: find the best average highway MPG by car class (with sample size).
```{r}
mpg |>
group_by(class) |>
summarise(
mean_hwy = mean(hwy),
n = n(),
.groups = "drop"
) |>
arrange(desc(mean_hwy))
```
<br>
# Practice Problems âś…
1. Use `select()` to keep only `manufacturer`, `model`, `year`, `hwy`.
2. Filter to only cars with `drv == "f"` and `year == 2008`.
3. Sort the filtered data by `hwy` (descending).
4. Create a new variable `efficiency = hwy / displ`.
5. Compute the mean of `efficiency`.
6. Compute mean `hwy` by `class` and sort from highest to lowest.
7. Find the top 5 classes by mean `hwy`.
<br>
# Challenge đź’ˇ (Top Manufacturers)
1. Count how many rows each manufacturer has.
2. Keep only manufacturers with **at least 20 cars** in the dataset.
3. Compute mean highway MPG by manufacturer.
4. Sort the results from highest to lowest.
Starter code:
```{r}
mpg |>
count(manufacturer, sort = TRUE)
```