Lecture 8

Data Transformation Basics

Byeong-Hak Choe

SUNY Geneseo

February 24, 2025

Data Transformation

It is rare that we get the data in exactly the right form we need.
We’ll need to create some new variables, summaries, rename the variables or reorder the observations in order to make the data a little easier to work with.
We’ll discuss how to transform data using the dplyr package and a dataset on flights departing New York City in 2013.

library(tidyverse)
library(nycflights13)
library(skimr)
flights <- flights
View(flights) 
?flights
skim(flights)

Data Transformation

`dplyr` basics

We are going to discuss the five key dplyr functions to solve various data manipulation challenges:
- Filter observations by logical conditions about values of variables (filter()).
- Arrange/sort rows (arrange()).
- Select variables by their names (select()).
- Create new variables with functions of existing variables (mutate()).
- Collapse a data.frame down to a summarized version of it (summarize()).
- Group a data.frame by a categorical variable (group_by()).

Data Transformation

`dplyr` basics

filter(<DATA.FRAME>, LOGICAL_STATEMENT)
arrange(<DATA.FRAME>, VARIABLES)
select(<DATA.FRAME>, VARIABLES)
group_by(<DATA.FRAME>, VARIABLES)
mutate(<DATA.FRAME>, NEW_VARIABLE = ... )
summarize(<DATA.FRAME>, NEW_VARIABLE = ...)
The first argument is a data.frame.
The subsequent arguments describe what to do with the data.frame, mostly using the variable names.
The result is also a data.frame.

Data Transformation

Because the first argument is a data.frame and the output is a data.frame, dplyr verbs work well with the pipe, |>
- Ctrl + Shift + M for Windows; command + Shift + M for Mac.
The pipe (|>) takes the thing on its left and passes it along to the function on its right so that
- f(x, y) is equivalent to x |> f(y).
- e.g., filter(<DATA.FRAME>, LOGICAL_STATEMENT) is equivalent to <DATA.FRAME> |> filter(LOGICAL_STATEMENT).
The easiest way to pronounce the pipe (|>) is “then”.
- The pipe (|>) is super useful when we have a chain of data transforming operations to do.

`filter()`

Filter rows with `filter()`

filter() allows us to subset observations based on the value of logical conditions, which are either TRUE or FALSE.

filter(flights, month == 1, day == 1)
flights |> filter(month == 1, day == 1) ## equilvant to the above

jan1 <- flights |> 
  filter(month == 1, day == 1)

dec25 <- flights |> 
  filter(month == 12, day == 25)

class(flights$month == 1)

Filter rows with `filter()`

`logical` conditions

Conditions are expressions that evaluate as logical

Operator	Description
x == y	Is x equal to y?
x != y	Is x not equal to y?
x > y	Is x greater than y?
x >= y	Is x greater than or equal to y?
x < y	Is x less than y?
x <= y	Is x less than or equal to y?

Filter rows with `filter()`

`logical` conditions

The == is an operator that compares the objects on either side and returns TRUE if they have the same values.

df <- tibble( var = c(9, 10, 11) ) 
df |> filter(var == 10)

cond <- num_vec == 10
df |> filter(cond)

Q. What does ! (!F) evaluate to?

Filter rows with `filter()`

De Morgan’s law

!(x & y) is the same as !x | !y.
!(x | y) is the same as !x & !y.
Comparison and logical operators can be used with filter():

filter(flights, month == 11 | month == 12)
nov_dec <- flights |> 
  filter(month %in% c(11, 12))

flights |> 
  filter(!(arr_delay > 120 | dep_delay > 120))

flights |> 
  filter(arr_delay <= 120 & dep_delay <= 120)

Filter rows with `filter()`

Missing values (`NA`)

Almost any operation involving an unknown value (NA) will also be unknown.

NA > 5
10 == NA
NA + 10
NA / 2

NA == NA

Let x be Mary’s age. We don’t know how old she is.
Let y be John’s age. We don’t know how old he is.
Are John and Mary the same age?

x <- NA
y <- NA
x == y

If we want to determine if a value is missing, use is.na().
If we want to preserve missing values, ask filter() for them explicitly.

x <- NA
is.na(x) ## is x NA?

y <- "missing"
is.na(y) ## is y NA?

df <- tibble(y = c(1, NA, 3))

df |> 
  filter(y > 1)

df |> 
  filter(is.na(y) | y > 1)

Filter rows with `filter()`

Exercises

Find all flights that
- Had an arrival delay of two or more hours
- Flew to Houston (IAH or HOU)
- Were operated by United, American, or Delta
- Departed in summer (July, August, and September)
- Arrived more than two hours late, but didn’t leave late
- Were delayed by at least an hour, but made up over 30 minutes in flight
- Departed between midnight and 6am (inclusive)

`arrange()`

Arrange rows with `arrange()`

arrange() changes their order.
If we provide more than one column name, each additional column will be used to break ties in the values of preceding columns.
Use desc() to re-order by a column in descending order.

flights |> arrange(year, month, day)

## re-order observations by dep_delay in descending order.
flights |> arrange([?])

Arrange rows with `arrange()`

Exercises

How could you use arrange() to sort all missing values to the start? (Hint: use is.na()).
Sort flights to find the most delayed flights. Find the flights that left earliest.
Sort flights to find the fastest (highest speed) flights.
Which flights traveled the farthest? Which traveled the shortest?

`select()`

Select columns with `select()`

It’s not uncommon to get datasets with hundreds or thousands of variables.
select() allows us to narrow in on the variables we’re actually interested in.

flights |> 
  select(year, month, day)

flights |> 
  select(year:day)

flights |> 
  select(-(year:day))

Select columns with `select()`

There are a number of helper functions we can use within select():
starts_with("abc"): matches names that begin with “abc”.
ends_with("xyz"): matches names that end with “xyz”.
contains("ijk"): matches names that contain “ijk”.
num_range("x", 1:3): matches x1, x2 and x3.

Select columns with `select()`

Rename variables with `rename()`

rename() can be used to rename variables:
- rename(<DATA>, <NEW_VARIABLE> = <EXISTING_VARIABLE>)

flights |> 
  rename( tail_num = tailnum )

We can use relocate() to move variables around.
We can also use select() with everything() to move variables around.

flights |> 
  relocate(time_hour, air_time)

flights |> 
  select(time_hour, air_time, everything())

We can also use relocate() to move variables around.
- We can specify .before and .after arguments to choose where to put variables:

flights |> 
  relocate(year:dep_time, .after = time_hour)

flights |> 
  relocate(starts_with("arr"), .before = dep_time)

`mutate()`

Add new variables with `mutate()`

mutate()
.before
.after

We can use the mutate() to add variables to the data.frame:

flights |> 
  mutate(
    gain = dep_delay - arr_delay,
    speed = distance / air_time * 60
  )

We can use the .before argument to add the variables to the position of a column:

flights |> 
  mutate(
    gain = dep_delay - arr_delay,
    speed = distance / air_time * 60,
    .before = 1  ## try different position numbers.
  )

The . is a sign that .before is an argument to the function, not the name of variable.

In both .before and .after, we can use the variable name instead of a position number.

flights |> 
  mutate(
    gain = dep_delay - arr_delay,
    speed = distance / air_time * 60,
    .after = day  
  )

Add new variables with `mutate()`

There are many functions for creating new variables that we can use with mutate().
Offsets: lead() and lag() allow us to refer to leading or lagging values.
Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: cumsum(), cummean(), cummin(), cummax(), and cumprod().

(x <- 1:10)
lag(x)
lead(x)
cumsum(x)
cummean(x)

Add new variables with `mutate()`

Useful creation functions

There are many functions for creating new variables that we can use with mutate().
Ranking: min_rank(), dense_rank(), row_number(), percent_rank(), and cume_dist().

rank_me <- tibble( x = c(10, 5, 1, 5, 5, NA) )
rank_me <- rank_me |> 
  mutate(id_x = row_number(x),
         x_min_rank = min_rank(x),
         x_dense_rank = dense_rank(x)
         )

Add new variables with `mutate()`

Useful creation functions (cont.)

To create new variables based on a condition, we can use the if_else() and ifelse() functions.
- if_else(<condition>, <if TRUE>, <else>)
- ifelse(<condition>, <if TRUE>, <else>)

flight_season <- flights |>  mutate(
    summer_month = if_else(month %in% c(6, 7, 8), TRUE, FALSE))

flight_season <- flights |>  mutate(
    summer_month = if_else(month %in% c(6, 7, 8), TRUE, 0))

flight_season <- flights |>  mutate(  ## ifelse() is less restrictive
    summer_month = ifelse(month %in% c(6, 7, 8), TRUE, 0))

`filter()`, `arrange()`, `select()`, and `mutate()`

Exercises

Find all flights that arrived more than two hours late, but didn’t leave late.
Sort flights to find the fastest (highest speed) flights.
Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.
Find the 10 most delayed flights using a ranking function. How do you want to handle ties?

`summarize()`

Grouped summaries with `summarize()`

summarize() collapses a data frame to a single row.
na.rm argument removes the missing values prior to computation when using summarize() or other aggregate functions.
summarize() is not terribly useful unless we pair it with group_by().

flights |> 
  summarize(delay = mean(dep_delay, na.rm = TRUE))

by_day <- flights |> group_by(year, month, day)
by_day |> summarize(delay = mean(dep_delay, na.rm = TRUE))

Grouped summaries with `summarize()`

Example

Explore the relationship between the mean level distance and the mean level arr_delay for each value of dest.

by_dest <- flights |> group_by(dest)
delay <- by_dest |> summarize(
  count = n(),
  dist = mean(distance, na.rm = TRUE),
  delay = mean(arr_delay, na.rm = TRUE)
)
delay <- delay |> 
  filter(count > 20, dest != "HNL")
ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
  geom_point(aes(size = count), alpha = 1/3) +
  geom_smooth(se = FALSE)

Grouped summaries with `summarize()`

Missing values

What happens if we don’t set na.rm?

flights |> 
  group_by(year, month, day) |> 
  summarize(mean = mean(dep_delay))
  
flights |> 
  group_by(year, month, day) |> 
  summarize(mean = mean(dep_delay, na.rm = TRUE))

Grouped summaries with `summarize()`

Missing values

Suppose missing values represent cancelled flights.

not_cancelled <- flights |> 
  filter(!is.na(dep_delay), !is.na(arr_delay))

not_cancelled |> 
  group_by(year, month, day) |> 
  summarize(mean = mean(dep_delay))

Grouped summaries with `summarize()`

Counts

Let’s look at the planes (identified by their tail number (tailnum)) that have the highest average delays

delays <- not_cancelled |> 
  group_by(tailnum) |> 
  summarize(
    delay = mean(arr_delay)
  )

ggplot(data = delays, mapping = aes(x = delay)) + 
  geom_freqpoly(binwidth = 10)

Grouped summaries with `summarize()`

Counts

When doing any aggregation, it’s always a good idea to include either a count (n()), or a count of non-missing values (sum(!is.na(x))).

delays <- not_cancelled |> 
  group_by(tailnum) |> 
  summarize( delay = mean(arr_delay, na.rm = TRUE),
             n = n() )

ggplot(data = delays, mapping = aes(x = n, y = delay)) + 
  geom_point(alpha = 1/10)

delays |> 
  filter(n > 25) |> 
  ggplot(mapping = aes(x = n, y = delay)) + 
    geom_point(alpha = 1/10)

Grouped summaries with `summarize()`

Useful summary functions

Measures of location: mean(x) and median(x).
Measures of spread: sd(x), IQR(x).

not_cancelled |> 
  group_by(year, month, day) |> 
  summarize( avg_delay1 = mean(arr_delay),
             avg_delay2 = mean(arr_delay[arr_delay > 0]) ) ## the average positive delay
## Brackets[] lets us filter (or subset) data from a column
## Why is distance to some destinations more variable than to others?
not_cancelled |> 
  group_by(dest) |> 
  summarize(distance_sd = sd(distance)) |> 
  arrange(desc(distance_sd))

Grouped summaries with `summarize()`

Useful summary functions

Measures of rank: min(x), quantile(x, 0.25), max(x), range(x).
Measures of position: first(x), nth(x, 2), last(x).
When do the first and last flights leave each day?

not_cancelled |> 
  group_by(year, month, day) |> 
  summarize(first = min(dep_time), 
            last = max(dep_time) )

not_cancelled |> 
  group_by(year, month, day) |> 
  summarize(first_dep = first(dep_time),last_dep = last(dep_time) )

not_cancelled |> 
  group_by(year, month, day) |> 
  mutate(r = min_rank(desc(dep_time))) |> 
  filter(r %in% range(r) )

Grouped summaries with `summarize()`

Useful summary functions

Measures of count: sum(!is.na(x)), n_distinct(x), count(x), and count(x, wt = VARIABLE).

## Which destinations have the most carriers?
not_cancelled |> 
  group_by(dest) |> 
  summarise(carriers = n_distinct(carrier)) |> 
  arrange(desc(carriers))
  
  
not_cancelled |> 
  count(dest)

not_cancelled |> 
  count(tailnum, wt = distance)

Grouped summaries with `summarize()`

Useful summary functions

We can compute the height of bars in a histogram by combining dplyr::count() and ggplot2::cut_width():

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = carat), binwidth = 0.5)

diamonds |> 
  count(cut_width(carat, 0.5))

Grouped summaries with `summarize()`

Useful summary functions

Counts and proportions of logical values: sum(x > 10), mean(y == 0).

## How many flights left before 5am? 
## (these usually indicate delayed flights from the previous day)
not_cancelled |> 
  group_by(year, month, day) |> 
  summarize(n_early = sum(dep_time < 500))

## What proportion of flights are delayed by more than an hour?
not_cancelled |> 
  group_by(year, month, day) |> 
  summarize(hour_prop = mean(arr_delay > 60))

Grouped summaries with `summarize()`

Grouping by multiple variables

When you group by multiple variables, each summary peels off one level of the grouping.

daily <- flights |> group_by(year, month, day)

(per_day   <- daily |> summarize(flights = n()))

(per_month <- per_day |> summarize(flights = sum(flights)))

(per_year  <- per_month |> summarize(flights = sum(flights)))

Grouped summaries with `summarize()`

Ungrouping

If you need to remove grouping, and return to operations on ungrouped data, use ungroup().

daily |> 
  ungroup() |>             ## no longer grouped by date
  summarize(flights = n())  ## all flights

Grouped summaries with `summarize()`

Exercises

Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios:
- A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
- A flight is always 10 minutes late.
- A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
- 99% of the time a flight is on time. 1% of the time it’s 2 hours late.

Q. Which is more important: arrival delay or departure delay?

Grouped summaries with `summarize()`

Exercises

Come up with another approach that will give you the same output as not_cancelled |> count(dest) and not_cancelled |> count(tailnum, wt = distance) (without using count()).
Our definition of cancelled flights (is.na(dep_delay) | is.na(arr_delay) ) is slightly suboptimal. Why? Which is the most important column?
Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?

Grouped summaries with `summarize()`

Exercises

Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights |> group_by(carrier, dest) |> summarize(n()))
What does the sort argument to count() do. When might you use it?

Grouped mutates

Grouped mutates (and filters)

Grouping is most useful in conjunction with summarize(), but you can also do convenient operations with mutate() and filter().

worst
threshold
per group metrics

Find the worst members of each group:

flights |> 
  group_by(year, month, day) |>
  filter(dense_rank(desc(arr_delay)) < 10)

Find all groups bigger than a threshold:

popular_dests <- flights |> 
  group_by(dest) |> 
  filter(n() > 17250)
popular_dests

We can compute per group metrics:

popular_dests |> 
  filter(arr_delay > 0) |> 
  mutate(prop_delay = arr_delay / sum(arr_delay)) |> 
  select(year:day, dest, arr_delay, prop_delay)

Grouped mutates (and filters)

Exercises

Refer back to the lists of useful mutate and filtering functions. Describe how each operation changes when you combine it with grouping.
Which plane (tailnum) has the worst on-time record?
What time of day should you fly if you want to avoid delays as much as possible?
For each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination.

Grouped mutates (and filters)

Exercises

Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using lag(), explore how the delay of a flight is related to the delay of the immediately preceding flight.
Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error). Compute the air time of a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?

Grouped mutates (and filters)

Exercises

Find all destinations that are flown by at least two carriers. Use that information to rank the carriers.
For each plane, count the number of flights before the first delay of greater than 1 hour.

Tidy Data

What are tidy data?

There are three rules which make a dataset tidy:
1. Each variable must have its own column.
2. Each observation must have its own row.
3. Each value must have its own cell.

Tidy data

We can represent the same underlying data in multiple ways.

library(tidyverse)
table1
table2
table3

## Spread across two tibbles
table4a  ## cases
table4b

Pivoting

The first step is always to figure out what form of the variables and observations we need.
The second step is to resolve one of two common problems:
- One variable might be spread across multiple columns.
- One observation might be scattered across multiple rows.
To fix these problems, we may need the two most important functions in the tidyr package: pivot_longer() and pivot_wider().

Pivoting

Longer

Some of the column names are not names of variables, but values of a variable.

table4a

We use pivot_longer() when a variable might be spread across multiple columns.

Pivoting

Longer

To tidy a dataset like table4a, we need to pivot the offending columns into a new pair of variables.
To use pivot_longer(), we need to the following three parameters:
- The set of columns (cols) whose names are values, not variables.
- The name of the variable to move the column names to (names_to).
- The name of the variable to move the column values to (values_to).

table4a |> 
  pivot_longer(cols = c(`1999`, `2000`), 
               names_to = "year", 
               values_to = "cases")

Pivoting

Longer

Pivoting

Wider

One observation might be scattered across multiple rows.

table2

We use pivot_wider() when an observation is scattered across multiple rows.

Pivoting

Wider

To use pivot_wider(), we need two parameters:
- The column to take variable names from.
- The column to take values from.
We can add the optimal parameter, names_prefix, for the prefix of column names.

table2
table2 |>
    pivot_wider(names_from = type, values_from = count)

Pivoting

Wider

Separating and Uniting

`separate`

table3 has one column (rate) that contains two variables (cases and population).
separate() takes the name of the column to separate, and the names of the columns to separate into.

table3

table3 |> 
  separate(rate, into = c("cases", "population"))

Separating and Uniting

`separate`

If we want to use a specific chracter to seprate a column, we can pass the character to the sep parameter.

table3 |> 
  separate(rate, into = c("cases", "population"), sep = "/")

Separating and Uniting

`separate`

By default, separate() leaves the type of the column as is.

table3 |> 
  separate(rate, into = c("cases", "population"))

Separating and Uniting

`separate`

separate() will interpret the integers as positions to split at.
- We can also pass a vector of integers to sep.
- separate() will interpret the integer as positions to split at.

table3 |> 
  separate(year, into = c("century", "year"), sep = 2)

Separating and Uniting

`unite`

unite() combines multiple columns into a single column.
The default will place an underscore (_) between the values from different columns.

table5 |> 
  unite(new, century, year)
  
table5 |> 
  unite(new, century, year, sep = "")

Data Assembly

rbind() can row-bind data.frames.
cbind() can column-bind data.frames.

df1 <- data.frame(id = 1:3, 
                  name = c("Alice", "Bob", "Charlie")
                  )
df2 <- data.frame(id = 4:6, 
                  name = c("Dave", "Eve", "Frank")
                  )

df_rbind <- rbind(df1, df2)
df_cbind <- cbind(df1, df2) ## better to rename id before cbind()

Data Assembly

rbind() does not work if the two data.frames have different names of columns.

df1 <- data.frame(id1 = 1:3, 
                  name = c("Alice", "Bob", "Charlie")
                  )
df2 <- data.frame(id2 = 4:6, 
                  name = c("Dave", "Eve", "Frank")
                  )

df_rbind <- rbind(df1, df2)

Data Assembly

cbind() does not work if the two data.frames have different numbers of rows.

df1 <- data.frame(id1 = 1:5, 
                  name = c("Alice", "Bob", "Charlie", "Dan", "Eva")
                  )
df2 <- data.frame(id2 = 6:8, 
                  name = c("Dave", "Eve", "Frank")
                  )

df_cbind <- cbind(df1, df2)

Relational Data

Relational data

It’s rare that a data analysis involves only a single data frame.
Collectively, multiple data frames are called relational data.
To work with relational data, we need verbs that work with pairs of data frames.
join methods add new variables to one data frame from matching observations in another data frame.

Relational data

`nycflights13`

nycflights13 contains four data frames that are related to the data frame, flights, that we used in data transformation.

library(nycflights13)
flights
airlines
airports
planes
weather

Relational data

`nycflights13`

flights connects to …
planes via a single variable, tailnum.
airlines through the carrier variable.
airports in two ways: via the origin and dest variables.
weather via origin (the location), and year, month, day and hour (the time).

Relational data

`nycflights13`

Relational data

Key variables

A key variable (or a set of key variables) is a variable (or a set of variables) that uniquely identifies an observation.
- So, a key variable (or a set of key variables) is used to connect relational data.frames.
- The name of a key variable can be different across relational data.frames.

Relational data

A join allows us to combine two data frames via key variables.
- It first matches observations by their keys, then copies across variables from one data.frame to the other.

Tidyverse	SQL	Description
left	left outer	Keep all the observations from the left
right	right outer	Keep all the observations from the right
full	full outer	Keep all the observations from both left and right
inner	inner	Keep only the observations whose key values exist in both

Relational data

Joins

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     3, "x3" )

y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     4, "y3" )

The colored column represents the “key” variable.
The grey column represents the “value” column.

Joins

inner
left
right
full

An inner join matches pairs of observations whenever their keys are equal:

join_inner <- x |> 
  inner_join(y)

A left join keeps all observations in x.

join_left <- x |> 
  left_join(y)

The most commonly used join is the left join.

A right join keeps all observations in y.

join_right <- x |> 
  right_join(y)

A full join keeps all observations in x and y.

join_full <- x |> 
  full_join(y)

Joins

Duplicate keys

Relations
one-to-many
many-to-many

Relations between data.frames could be …
- One-to-one;
- One-to-many (e.g., airlines and airports);
- Many-to-one;
- Many-to-many (e.g., flights and airplanes).
One-to-many relation is such that one data.frame has duplicate keys
Many-to-many relation is such that both data.frames have duplicate keys.
- In practice, it is recommended to avoid joining data.frames with many-to-many relation.

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     2, "x3",
     1, "x4")

y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2")

x |> 
  left_join(y)

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     2, "x3",
     3, "x4" )

y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     2, "y3",
     3, "y4" )
x |> 
  left_join(y)

Joins

Defining the key columns

by = "a": uses only variable a.
by = c("a" = "b"): match variable a in data frame x to variable b in data frame y.

flights2 |> 
  left_join(weather)
  
flights2 |> 
  left_join(planes, 
            by = "tailnum")

flights2 |> 
  left_join(airports, 
            c("dest" = "faa"))
  
flights2 |> 
  left_join(airports, 
            c("origin" = "faa"))

Factors

Creating factors

In R, factors are categorical variables, variables that have a fixed and known set of possible values.

x1 <- c("Dec", "Apr", "Jan", "Mar")

Using a string to record variable x1 has two problems:
1. There are only twelve possible months, and there’s nothing saving us from typos.
2. It doesn’t sort in a useful way.

x2 <- c("Dec", "Apr", "Jam", "Mar")
sort(x1)

Factors

Creating factors with `factor()`

We can fix both of these problems with factor().
To create a factor, we must start by creating a list of the valid levels.
Any values not in the set will be silently converted to NA.
If we omit the levels, they’ll be taken from the data in alphabetical order:

months <- c(
  "Jan", "Feb", "Mar", "Apr", 
  "May", "Jun", "Jul", "Aug", 
  "Sep", "Oct", "Nov", "Dec")
x1 <- 
  c("Dec", "Apr", "Jan", "Mar")
y1 <- factor(x1, 
             levels = months)
sort(y1)

x2 <- 
  c("Dec", "Apr", "Jam", "Mar")

y2 <- factor(x2, 
             levels = months)
y2

factor(x1)

Factors

Creating factors with `factor()`

Sometimes we’d prefer that the order of the levels match the order of the first appearance in the data.
We can do that when creating the factor by setting levels to unique().
If we ever need to access the set of valid levels directly, we can do so with levels().

x1

f1 <- factor(x1, levels = unique(x1))
f1

levels(f1)

Factors

We’re going to focus on the data frame, forcats::gss_cat.which is a sample of data from the General Social Survey.
When factors are stored in a data frame, we can see them with count().

gss_cat

gss_cat |>
  [?](race)

Factors

Modifying factor order

It’s often useful to change the order of the factor levels in a visualization.
Imagine we want to explore the average number of hours spent watching TV per day across relig:

relig_summary <- gss_cat |>
  group_by([?]) |>
  summarize(
    age = [?](age, [?]),
    tvhours = [?](tvhours, [?]),
    n = [?]
  )

ggplot(relig_summary, 
       aes(tvhours, relig)) + 
  [?]()

Factors

Modifying factor order

We can reorder the levels using fct_reorder(f, x, fun), which can take three arguments.
f: the factor whose levels we want to modify.
x: a numeric vector that we want to use to reorder the levels.
Optionally, fun: a function that’s used if there are multiple values of x for each value of f. The default value is median.

relig_summary |>
  mutate(relig = [?](relig, tvhours)) |>
  ggplot(aes(tvhours, relig)) + [?]()

Factors

Modifying factor order

We can use relevel() to set the first level (reference level).
relevel(x, ref = ...) takes at least the two arguments:
- x: factor variable
- ref: reference level or first level

rincome_summary <- gss_cat |>
  group_by([?]) |>
  summarize(
    age = [?](age, [?]),
    tvhours = [?](tvhours, [?]),
    n = [?]
  )

ggplot(rincome_summary, 
       aes(age, [?](rincome, age) ) )  + 
  geom_point()
ggplot(rincome_summary, 
       aes(age, 
           [?](rincome, 
               "Not applicable") ) ) +
  geom_point()

Strings

Combining strings

To count the length of string, use str_length().
To combine two or more strings, use str_c():
To control how strings are separated, add the sep.
To collapse a vector of strings into a single string, add the collapse.

str_length(c("a", "R for data science", NA))
str_c("x", "y", "z")
str_c(c("x", "y", "z"), sep = ", ")
str_c("prefix-", c("a", "b", "c"), "-suffix")
str_c(c("x", "y", "z"), collapse = ", ")

Strings

Subsetting strings

We can extract parts of a string using str_sub():
str_sub() takes start and end arguments which give the position of the substring:.

x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)

## negative numbers count backwards from end
str_sub(x, -3, -1)

Strings

Matching a pattern of strings

To determine if a character vector matches a pattern, use str_detect().
str_replace() and str_replace_all() allow us to replace matches with new strings.
str_replace_all() can perform multiple replacements by supplying a named vector.

x <- c("apple", "banana", "pear")

str_detect(x, "e")
str_replace(x, "a", "-")
str_replace_all(x, "a", "-")

Strings

Splitting strings

Use str_split() to split a string up into pieces.

sentences |>
  head(5) |> 
  str_split(" ")

Dates and times

Creating date/times

To get the current date or date-time we can use today() or now().
The lubridate functions are a combination of y, m, d, h, m, and s.

today()
now()

ymd("2017-01-31")
[?]("January 31st, 2017")
[?]("31-Jan-2017")

[?](20170131)

[?]("2017-01-31 20:11:59")
[?]("01/31/2017 08:01")

[?](20170131, [?] = "UTC")

Dates and times

Creating date/times

To create a date/time from individual components, use make_date() for dates, or make_datetime() for date-times:

flights |> 
  select(year, month, day, hour, minute)
  
flights |> 
  select(year, month, day, hour, minute) |> 
  mutate(departure = [?](year, month, day, hour, minute))
  
flights |> 
  select(year, month, day, hour, minute) |> 
  mutate(departure = [?](year, month, day))

Dates and times

Creating date/times

We can visualize time series data using lubridate functions.

flights_dt |> 
  ggplot(aes(dep_time)) + 
  geom_freqpoly(binwidth = 86400) ## 86400 seconds = 1 day
  
flights_dt |> 
  filter(dep_time < ymd(20130102)) |> 
  ggplot(aes(dep_time)) + 
  geom_freqpoly(binwidth = 600) ## 600 s = 10 minutes

Dates and times

Creating date/times

as_datetime() and as_date() switch between a date-time and a date.

[?](today())
[?](now())

Lecture 8

Data Transformation

Data Transformation

Data Transformation

dplyr basics

Data Transformation

dplyr basics

Data Transformation

filter()

Filter rows with filter()

Filter rows with filter()

logical conditions

Filter rows with filter()

logical conditions

Filter rows with filter()

De Morgan’s law

Filter rows with filter()

Missing values (NA)

Filter rows with filter()

Exercises

arrange()

Arrange rows with arrange()

Arrange rows with arrange()

Exercises

select()

Select columns with select()

Select columns with select()

Select columns with select()

Rename variables with rename()

Change the Order of Variables

mutate()

Add new variables with mutate()

Add new variables with mutate()

Add new variables with mutate()

Useful creation functions

Add new variables with mutate()

Useful creation functions (cont.)

filter(), arrange(), select(), and mutate()

Exercises

summarize()

Grouped summaries with summarize()

Grouped summaries with summarize()

Example

Grouped summaries with summarize()

Missing values

Grouped summaries with summarize()

Missing values

Grouped summaries with summarize()

Counts

Grouped summaries with summarize()

Counts

Grouped summaries with summarize()

Useful summary functions

Grouped summaries with summarize()

Useful summary functions

Grouped summaries with summarize()

Useful summary functions

Grouped summaries with summarize()

Useful summary functions

Grouped summaries with summarize()

Useful summary functions

Grouped summaries with summarize()

Grouping by multiple variables

Grouped summaries with summarize()

Ungrouping

Grouped summaries with summarize()

Exercises

Grouped summaries with summarize()

Exercises

Grouped summaries with summarize()

Exercises

Grouped mutates

Grouped mutates (and filters)

Grouped mutates (and filters)

Exercises

Grouped mutates (and filters)

Exercises

Grouped mutates (and filters)

Exercises

Tidy Data

`dplyr` basics

`dplyr` basics

`filter()`

Filter rows with `filter()`

Filter rows with `filter()`

`logical` conditions

Filter rows with `filter()`

`logical` conditions

Filter rows with `filter()`

Filter rows with `filter()`

Missing values (`NA`)

Filter rows with `filter()`

`arrange()`

Arrange rows with `arrange()`

Arrange rows with `arrange()`

`select()`

Select columns with `select()`

Select columns with `select()`

Select columns with `select()`

Rename variables with `rename()`

`mutate()`

Add new variables with `mutate()`

Add new variables with `mutate()`

Add new variables with `mutate()`

Add new variables with `mutate()`

`filter()`, `arrange()`, `select()`, and `mutate()`

`summarize()`

Grouped summaries with `summarize()`

Grouped summaries with `summarize()`

Grouped summaries with `summarize()`

Grouped summaries with `summarize()`

Grouped summaries with `summarize()`

Grouped summaries with `summarize()`

Grouped summaries with `summarize()`

Grouped summaries with `summarize()`

Grouped summaries with `summarize()`

Grouped summaries with `summarize()`

Grouped summaries with `summarize()`

Grouped summaries with `summarize()`

Grouped summaries with `summarize()`

Grouped summaries with `summarize()`

Grouped summaries with `summarize()`

Grouped summaries with `summarize()`

`separate`

`separate`

`separate`

`separate`

`unite`

`nycflights13`

`nycflights13`

`nycflights13`

Creating factors with `factor()`

Creating factors with `factor()`