Lecture 8

Data Transformation Basics

Byeong-Hak Choe

SUNY Geneseo

February 24, 2024

Data Transformation

Data Transformation

  • It is rare that we get the data in exactly the right form we need.

  • We’ll need to create some new variables, summaries, rename the variables or reorder the observations in order to make the data a little easier to work with.

  • We’ll discuss how to transform data using the dplyr package and a dataset on flights departing New York City in 2013.

library(tidyverse)
library(nycflights13)
library(skimr)
flights <- flights
View(flights) 
?flights
skim(flights)

Data Transformation

dplyr basics

  • We are going to discuss the five key dplyr functions to solve various data manipulation challenges:
    • Filter observations by logical conditions about values of variables (filter()).
    • Arrange/sort rows (arrange()).
    • Select variables by their names (select()).
    • Create new variables with functions of existing variables (mutate()).
    • Collapse a data.frame down to a summarized version of it (summarize()).
    • Group a data.frame by a categorical variable (group_by()).

Data Transformation

dplyr basics

  • filter(<DATA.FRAME>, LOGICAL_STATEMENT)

  • arrange(<DATA.FRAME>, VARIABLES)

  • select(<DATA.FRAME>, VARIABLES)

  • group_by(<DATA.FRAME>, VARIABLES)

  • mutate(<DATA.FRAME>, NEW_VARIABLE = ... )

  • summarize(<DATA.FRAME>, NEW_VARIABLE = ...)

  • The first argument is a data.frame.

  • The subsequent arguments describe what to do with the data.frame, mostly using the variable names.

  • The result is also a data.frame.

Data Transformation

  • Because the first argument is a data.frame and the output is a data.frame, dplyr verbs work well with the pipe, |>
    • Ctrl + Shift + M for Windows; command + Shift + M for Mac.
  • The pipe (|>) takes the thing on its left and passes it along to the function on its right so that
    • f(x, y) is equivalent to x |> f(y).
    • e.g., filter(<DATA.FRAME>, LOGICAL_STATEMENT) is equivalent to <DATA.FRAME> |> filter(LOGICAL_STATEMENT).
  • The easiest way to pronounce the pipe (|>) is “then”.
    • The pipe (|>) is super useful when we have a chain of data transforming operations to do.

filter()

Filter rows with filter()

  • filter() allows us to subset observations based on the value of logical conditions, which are either TRUE or FALSE.
filter(flights, month == 1, day == 1)
flights |> filter(month == 1, day == 1) ## equilvant to the above

jan1 <- flights |> 
  filter(month == 1, day == 1)

dec25 <- flights |> 
  filter(month == 12, day == 25)

class(flights$month == 1)

Filter rows with filter()

logical conditions

Conditions are expressions that evaluate as logical

Operator Description
x == y Is x equal to y?
x != y Is x not equal to y?
x > y Is x greater than y?
x >= y Is x greater than or equal to y?
x < y Is x less than y?
x <= y Is x less than or equal to y?

Filter rows with filter()

logical conditions

  • The == is an operator that compares the objects on either side and returns TRUE if they have the same values.
df <- tibble( var = c(9, 10, 11) ) 
df |> filter(var == 10)

cond <- num_vec == 10
df |> filter(cond)

Q. What does ! (!F) evaluate to?

Filter rows with filter()

De Morgan’s law

  • !(x & y) is the same as !x | !y.

  • !(x | y) is the same as !x & !y.

  • Comparison and logical operators can be used with filter():

filter(flights, month == 11 | month == 12)
nov_dec <- flights |> 
  filter(month %in% c(11, 12))

flights |> 
  filter(!(arr_delay > 120 | dep_delay > 120))

flights |> 
  filter(arr_delay <= 120 & dep_delay <= 120)

Filter rows with filter()

Missing values (NA)

  • Almost any operation involving an unknown value (NA) will also be unknown.
NA > 5
10 == NA
NA + 10
NA / 2

NA == NA
  • Let x be Mary’s age. We don’t know how old she is.
  • Let y be John’s age. We don’t know how old he is.
  • Are John and Mary the same age?
x <- NA
y <- NA
x == y
  • If we want to determine if a value is missing, use is.na().
  • If we want to preserve missing values, ask filter() for them explicitly.
x <- NA
is.na(x) ## is x NA?

y <- "missing"
is.na(y) ## is y NA?
df <- tibble(y = c(1, NA, 3))

df |> 
  filter(y > 1)

df |> 
  filter(is.na(y) | y > 1)

Filter rows with filter()

Exercises

  • Find all flights that
    • Had an arrival delay of two or more hours
    • Flew to Houston (IAH or HOU)
    • Were operated by United, American, or Delta
    • Departed in summer (July, August, and September)
    • Arrived more than two hours late, but didn’t leave late
    • Were delayed by at least an hour, but made up over 30 minutes in flight
    • Departed between midnight and 6am (inclusive)

arrange()

Arrange rows with arrange()

  • arrange() changes their order.

  • If we provide more than one column name, each additional column will be used to break ties in the values of preceding columns.

  • Use desc() to re-order by a column in descending order.

flights |> arrange(year, month, day)

## re-order observations by dep_delay in descending order.
flights |> arrange([?])

Arrange rows with arrange()

Exercises

  • How could you use arrange() to sort all missing values to the start? (Hint: use is.na()).

  • Sort flights to find the most delayed flights. Find the flights that left earliest.

  • Sort flights to find the fastest (highest speed) flights.

  • Which flights traveled the farthest? Which traveled the shortest?

select()

Select columns with select()

  • It’s not uncommon to get datasets with hundreds or thousands of variables.
  • select() allows us to narrow in on the variables we’re actually interested in.
flights |> 
  select(year, month, day)

flights |> 
  select(year:day)

flights |> 
  select(-(year:day))

Select columns with select()

  • There are a number of helper functions we can use within select():

  • starts_with("abc"): matches names that begin with “abc”.

  • ends_with("xyz"): matches names that end with “xyz”.

  • contains("ijk"): matches names that contain “ijk”.

  • num_range("x", 1:3): matches x1, x2 and x3.

Select columns with select()

Rename variables with rename()

  • rename() can be used to rename variables:

    • rename(<DATA>, <NEW_VARIABLE> = <EXISTING_VARIABLE>)
flights |> 
  rename( tail_num = tailnum )

Change the Order of Variables

  • We can use relocate() to move variables around.

  • We can also use select() with everything() to move variables around.

flights |> 
  relocate(time_hour, air_time)

flights |> 
  select(time_hour, air_time, everything())
  • We can also use relocate() to move variables around.
    • We can specify .before and .after arguments to choose where to put variables:
flights |> 
  relocate(year:dep_time, .after = time_hour)

flights |> 
  relocate(starts_with("arr"), .before = dep_time)

mutate()

Add new variables with mutate()

  • We can use the mutate() to add variables to the data.frame:
flights |> 
  mutate(
    gain = dep_delay - arr_delay,
    speed = distance / air_time * 60
  )
  • We can use the .before argument to add the variables to the position of a column:
flights |> 
  mutate(
    gain = dep_delay - arr_delay,
    speed = distance / air_time * 60,
    .before = 1  ## try different position numbers.
  )
  • The . is a sign that .before is an argument to the function, not the name of variable.
  • In both .before and .after, we can use the variable name instead of a position number.
flights |> 
  mutate(
    gain = dep_delay - arr_delay,
    speed = distance / air_time * 60,
    .after = day  
  )

Add new variables with mutate()

  • There are many functions for creating new variables that we can use with mutate().

  • Offsets: lead() and lag() allow us to refer to leading or lagging values.

  • Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: cumsum(), cummean(), cummin(), cummax(), and cumprod().

(x <- 1:10)
lag(x)
lead(x)
cumsum(x)
cummean(x)

Add new variables with mutate()

Useful creation functions

  • There are many functions for creating new variables that we can use with mutate().

  • Ranking: min_rank(), dense_rank(), row_number(), percent_rank(), and cume_dist().

rank_me <- tibble( x = c(10, 5, 1, 5, 5, NA) )
rank_me <- rank_me |> 
  mutate(id_x = row_number(x),
         x_min_rank = min_rank(x),
         x_dense_rank = dense_rank(x)
         )

Add new variables with mutate()

Useful creation functions (cont.)

  • To create new variables based on a condition, we can use the if_else() and ifelse() functions.
    • if_else(<condition>, <if TRUE>, <else>)
    • ifelse(<condition>, <if TRUE>, <else>)
flight_season <- flights |>  mutate(
    summer_month = if_else(month %in% c(6, 7, 8), TRUE, FALSE))

flight_season <- flights |>  mutate(
    summer_month = if_else(month %in% c(6, 7, 8), TRUE, 0))

flight_season <- flights |>  mutate(  ## ifelse() is less restrictive
    summer_month = ifelse(month %in% c(6, 7, 8), TRUE, 0))

filter(), arrange(), select(), and mutate()

Exercises

  • Find all flights that arrived more than two hours late, but didn’t leave late.

  • Sort flights to find the fastest (highest speed) flights.

  • Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

  • Find the 10 most delayed flights using a ranking function. How do you want to handle ties?

summarize()

Grouped summaries with summarize()

  • summarize() collapses a data frame to a single row.
  • na.rm argument removes the missing values prior to computation when using summarize() or other aggregate functions.
  • summarize() is not terribly useful unless we pair it with group_by().
flights |> 
  summarize(delay = mean(dep_delay, na.rm = TRUE))

by_day <- flights |> group_by(year, month, day)
by_day |> summarize(delay = mean(dep_delay, na.rm = TRUE))

Grouped summaries with summarize()

Example

  • Explore the relationship between the mean level distance and the mean level arr_delay for each value of dest.
by_dest <- flights |> group_by(dest)
delay <- by_dest |> summarize(
  count = n(),
  dist = mean(distance, na.rm = TRUE),
  delay = mean(arr_delay, na.rm = TRUE)
)
delay <- delay |> 
  filter(count > 20, dest != "HNL")
ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
  geom_point(aes(size = count), alpha = 1/3) +
  geom_smooth(se = FALSE)

Grouped summaries with summarize()

Missing values

  • What happens if we don’t set na.rm?
flights |> 
  group_by(year, month, day) |> 
  summarize(mean = mean(dep_delay))
  
flights |> 
  group_by(year, month, day) |> 
  summarize(mean = mean(dep_delay, na.rm = TRUE))

Grouped summaries with summarize()

Missing values

  • Suppose missing values represent cancelled flights.
not_cancelled <- flights |> 
  filter(!is.na(dep_delay), !is.na(arr_delay))

not_cancelled |> 
  group_by(year, month, day) |> 
  summarize(mean = mean(dep_delay))

Grouped summaries with summarize()

Counts

  • Let’s look at the planes (identified by their tail number (tailnum)) that have the highest average delays
delays <- not_cancelled |> 
  group_by(tailnum) |> 
  summarize(
    delay = mean(arr_delay)
  )

ggplot(data = delays, mapping = aes(x = delay)) + 
  geom_freqpoly(binwidth = 10)

Grouped summaries with summarize()

Counts

  • When doing any aggregation, it’s always a good idea to include either a count (n()), or a count of non-missing values (sum(!is.na(x))).
delays <- not_cancelled |> 
  group_by(tailnum) |> 
  summarize( delay = mean(arr_delay, na.rm = TRUE),
             n = n() )

ggplot(data = delays, mapping = aes(x = n, y = delay)) + 
  geom_point(alpha = 1/10)

delays |> 
  filter(n > 25) |> 
  ggplot(mapping = aes(x = n, y = delay)) + 
    geom_point(alpha = 1/10)

Grouped summaries with summarize()

Useful summary functions

  • Measures of location: mean(x) and median(x).

  • Measures of spread: sd(x), IQR(x).

not_cancelled |> 
  group_by(year, month, day) |> 
  summarize( avg_delay1 = mean(arr_delay),
             avg_delay2 = mean(arr_delay[arr_delay > 0]) ) ## the average positive delay
## Brackets[] lets us filter (or subset) data from a column
## Why is distance to some destinations more variable than to others?
not_cancelled |> 
  group_by(dest) |> 
  summarize(distance_sd = sd(distance)) |> 
  arrange(desc(distance_sd))

Grouped summaries with summarize()

Useful summary functions

  • Measures of rank: min(x), quantile(x, 0.25), max(x), range(x).

  • Measures of position: first(x), nth(x, 2), last(x).

  • When do the first and last flights leave each day?

not_cancelled |> 
  group_by(year, month, day) |> 
  summarize(first = min(dep_time), 
            last = max(dep_time) )
not_cancelled |> 
  group_by(year, month, day) |> 
  summarize(first_dep = first(dep_time),last_dep = last(dep_time) )
not_cancelled |> 
  group_by(year, month, day) |> 
  mutate(r = min_rank(desc(dep_time))) |> 
  filter(r %in% range(r) )

Grouped summaries with summarize()

Useful summary functions

  • Measures of count: sum(!is.na(x)), n_distinct(x), count(x), and count(x, wt = VARIABLE).
## Which destinations have the most carriers?
not_cancelled |> 
  group_by(dest) |> 
  summarise(carriers = n_distinct(carrier)) |> 
  arrange(desc(carriers))
  
  
not_cancelled |> 
  count(dest)

not_cancelled |> 
  count(tailnum, wt = distance)

Grouped summaries with summarize()

Useful summary functions

  • We can compute the height of bars in a histogram by combining dplyr::count() and ggplot2::cut_width():
ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = carat), binwidth = 0.5)

diamonds |> 
  count(cut_width(carat, 0.5))

Grouped summaries with summarize()

Useful summary functions

  • Counts and proportions of logical values: sum(x > 10), mean(y == 0).
## How many flights left before 5am? 
## (these usually indicate delayed flights from the previous day)
not_cancelled |> 
  group_by(year, month, day) |> 
  summarize(n_early = sum(dep_time < 500))

## What proportion of flights are delayed by more than an hour?
not_cancelled |> 
  group_by(year, month, day) |> 
  summarize(hour_prop = mean(arr_delay > 60))

Grouped summaries with summarize()

Grouping by multiple variables

  • When you group by multiple variables, each summary peels off one level of the grouping.
daily <- flights |> group_by(year, month, day)

(per_day   <- daily |> summarize(flights = n()))

(per_month <- per_day |> summarize(flights = sum(flights)))

(per_year  <- per_month |> summarize(flights = sum(flights)))

Grouped summaries with summarize()

Ungrouping

  • If you need to remove grouping, and return to operations on ungrouped data, use ungroup().
daily |> 
  ungroup() |>             ## no longer grouped by date
  summarize(flights = n())  ## all flights

Grouped summaries with summarize()

Exercises

  • Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios:

    • A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.

    • A flight is always 10 minutes late.

    • A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.

    • 99% of the time a flight is on time. 1% of the time it’s 2 hours late.

Q. Which is more important: arrival delay or departure delay?

Grouped summaries with summarize()

Exercises

  • Come up with another approach that will give you the same output as not_cancelled |> count(dest) and not_cancelled |> count(tailnum, wt = distance) (without using count()).

  • Our definition of cancelled flights (is.na(dep_delay) | is.na(arr_delay) ) is slightly suboptimal. Why? Which is the most important column?

  • Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?

Grouped summaries with summarize()

Exercises

  • Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights |> group_by(carrier, dest) |> summarize(n()))

  • What does the sort argument to count() do. When might you use it?

Grouped mutates

Grouped mutates (and filters)

  • Grouping is most useful in conjunction with summarize(), but you can also do convenient operations with mutate() and filter().
  • Find the worst members of each group:
flights |> 
  group_by(year, month, day) |>
  filter(rank(desc(arr_delay)) < 10)
  • Find all groups bigger than a threshold:
popular_dests <- flights |> 
  group_by(dest) |> 
  filter(n() > 17250)
popular_dests
  • We can compute per group metrics:
popular_dests |> 
  filter(arr_delay > 0) |> 
  mutate(prop_delay = arr_delay / sum(arr_delay)) |> 
  select(year:day, dest, arr_delay, prop_delay)

Grouped mutates (and filters)

Exercises

  • Refer back to the lists of useful mutate and filtering functions. Describe how each operation changes when you combine it with grouping.

  • Which plane (tailnum) has the worst on-time record?

  • What time of day should you fly if you want to avoid delays as much as possible?

  • For each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination.

Grouped mutates (and filters)

Exercises

  • Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using lag(), explore how the delay of a flight is related to the delay of the immediately preceding flight.

  • Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error). Compute the air time of a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?

Grouped mutates (and filters)

Exercises

  • Find all destinations that are flown by at least two carriers. Use that information to rank the carriers.

  • For each plane, count the number of flights before the first delay of greater than 1 hour.

Tidy Data

Tidy Data

What are tidy data?

  • There are three rules which make a dataset tidy:

    1. Each variable must have its own column.
    2. Each observation must have its own row.
    3. Each value must have its own cell.

Tidy data

  • We can represent the same underlying data in multiple ways.
library(tidyverse)
table1
table2
table3

## Spread across two tibbles
table4a  ## cases
table4b

Pivoting

Pivoting

  • The first step is always to figure out what form of the variables and observations we need.

  • The second step is to resolve one of two common problems:

    • One variable might be spread across multiple columns.

    • One observation might be scattered across multiple rows.

  • To fix these problems, we may need the two most important functions in the tidyr package: pivot_longer() and pivot_wider().

Pivoting

Longer

  • Some of the column names are not names of variables, but values of a variable.
table4a
  • We use pivot_longer() when a variable might be spread across multiple columns.

Pivoting

Longer

  • To tidy a dataset like table4a, we need to pivot the offending columns into a new pair of variables.

  • To use pivot_longer(), we need to the following three parameters:

    • The set of columns (cols) whose names are values, not variables.
    • The name of the variable to move the column names to (names_to).
    • The name of the variable to move the column values to (values_to).
table4a |> 
  pivot_longer(cols = c(`1999`, `2000`), 
               names_to = "year", 
               values_to = "cases")

Pivoting

Longer

Pivoting

Wider

  • One observation might be scattered across multiple rows.
table2
  • We use pivot_wider() when an observation is scattered across multiple rows.

Pivoting

Wider

  • To use pivot_wider(), we need two parameters:
    • The column to take variable names from.
    • The column to take values from.
  • We can add the optimal parameter, names_prefix, for the prefix of column names.
table2
table2 |>
    pivot_wider(names_from = type, values_from = count)

Pivoting

Wider

Separating and Uniting

Separating and Uniting

###separate

  • table3 has one column (rate) that contains two variables (cases and population).

  • separate() takes the name of the column to separate, and the names of the columns to separate into.

table3

table3 |> 
  separate(rate, into = c("cases", "population"))

Separating and Uniting

###separate

  • If we want to use a specific chracter to seprate a column, we can pass the character to the sep parameter.
table3 |> 
  separate(rate, into = c("cases", "population"), sep = "/")

Separating and Uniting

###separate

  • By default, separate() leaves the type of the column as is.
table3 |> 
  separate(rate, into = c("cases", "population"))

Separating and Uniting

###separate

  • separate() will interpret the integers as positions to split at.

    • We can also pass a vector of integers to sep.

    • separate() will interpret the integer as positions to split at.

table3 |> 
  separate(year, into = c("century", "year"), sep = 2)

Separating and Uniting

unite

  • unite() combines multiple columns into a single column.

  • The default will place an underscore (_) between the values from different columns.

table5 |> 
  unite(new, century, year)
  
table5 |> 
  unite(new, century, year, sep = "")

Data Assembly

Data Assembly

  • rbind() can row-bind data.frames.
  • cbind() can column-bind data.frames.
df1 <- data.frame(id = 1:3, 
                  name = c("Alice", "Bob", "Charlie")
                  )
df2 <- data.frame(id = 4:6, 
                  name = c("Dave", "Eve", "Frank")
                  )

df_rbind <- rbind(df1, df2)
df_cbind <- cbind(df1, df2) ## better to rename id before cbind()

Data Assembly

  • rbind() does not work if the two data.frames have different names of columns.
df1 <- data.frame(id1 = 1:3, 
                  name = c("Alice", "Bob", "Charlie")
                  )
df2 <- data.frame(id2 = 4:6, 
                  name = c("Dave", "Eve", "Frank")
                  )

df_rbind <- rbind(df1, df2)

Data Assembly

  • cbind() does not work if the two data.frames have different numbers of rows.
df1 <- data.frame(id1 = 1:5, 
                  name = c("Alice", "Bob", "Charlie", "Dan", "Eva")
                  )
df2 <- data.frame(id2 = 6:8, 
                  name = c("Dave", "Eve", "Frank")
                  )

df_cbind <- cbind(df1, df2)

Relational Data

Relational data

  • It’s rare that a data analysis involves only a single data frame.

  • Collectively, multiple data frames are called relational data.

  • To work with relational data, we need verbs that work with pairs of data frames.

  • join methods add new variables to one data frame from matching observations in another data frame.

Relational data

nycflights13

  • nycflights13 contains four data frames that are related to the data frame, flights, that we used in data transformation.
library(nycflights13)
flights
airlines
airports
planes
weather

Relational data

nycflights13

  • flights connects to …

  • planes via a single variable, tailnum.

  • airlines through the carrier variable.

  • airports in two ways: via the origin and dest variables.

  • weather via origin (the location), and year, month, day and hour (the time).

Relational data

nycflights13

Relational data

Key variables

  • A key variable (or a set of key variables) is a variable (or a set of variables) that uniquely identifies an observation.

    • So, a key variable (or a set of key variables) is used to connect relational data.frames.

    • The name of a key variable can be different across relational data.frames.

Relational data

  • A join allows us to combine two data frames via key variables.
    • It first matches observations by their keys, then copies across variables from one data.frame to the other.
Tidyverse SQL Description
left left outer Keep all the observations from the left
right right outer Keep all the observations from the right
full full outer Keep all the observations from both left and right
inner inner Keep only the observations whose key values exist in both

Relational data

Joins

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     3, "x3" )
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     4, "y3" )
  • The colored column represents the “key” variable.
  • The grey column represents the “value” column.

Joins

  • An inner join matches pairs of observations whenever their keys are equal:

join_inner <- x |> 
  inner_join(y)
  • A left join keeps all observations in x.

join_left <- x |> 
  left_join(y)
  • The most commonly used join is the left join.
  • A right join keeps all observations in y.

join_right <- x |> 
  right_join(y)
  • A full join keeps all observations in x and y.

join_full <- x |> 
  full_join(y)

Joins

Duplicate keys

  • Relations between data.frames could be …
    • One-to-one;
    • One-to-many (e.g., airlines and airports);
    • Many-to-one;
    • Many-to-many (e.g., flights and airplanes).
  • One-to-many relation is such that one data.frame has duplicate keys
  • Many-to-many relation is such that both data.frames have duplicate keys.
    • In practice, it is recommended to avoid joining data.frames with many-to-many relation.

:::: {.columns} ::: {.column width=“50%”}

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     2, "x3",
     1, "x4")
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2")

x |> 
  left_join(y)

many-to-many

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     2, "x3",
     3, "x4" )
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     2, "y3",
     3, "y4" )

x |> 
  left_join(y)

Joins

Defining the key columns

  • by = "a": uses only variable a.
  • by = c("a" = "b"): match variable a in data frame x to variable b in data frame y.
flights2 |> 
  left_join(weather)
  
flights2 |> 
  left_join(planes, 
            by = "tailnum")
flights2 |> 
  left_join(airports, 
            c("dest" = "faa"))
  
flights2 |> 
  left_join(airports, 
            c("origin" = "faa"))

Factors

Factors

Creating factors

  • In R, factors are categorical variables, variables that have a fixed and known set of possible values.
x1 <- c("Dec", "Apr", "Jan", "Mar")
  • Using a string to record variable x1 has two problems:
    1. There are only twelve possible months, and there’s nothing saving us from typos.
    2. It doesn’t sort in a useful way.
x2 <- c("Dec", "Apr", "Jam", "Mar")
sort(x1)

Factors

Creating factors with factor()

  • We can fix both of these problems with factor().
  • To create a factor, we must start by creating a list of the valid levels.
  • Any values not in the set will be silently converted to NA.
  • If we omit the levels, they’ll be taken from the data in alphabetical order:
months <- c(
  "Jan", "Feb", "Mar", "Apr", 
  "May", "Jun", "Jul", "Aug", 
  "Sep", "Oct", "Nov", "Dec")
x1 <- 
  c("Dec", "Apr", "Jan", "Mar")
y1 <- factor(x1, 
             levels = months)
sort(y1)
x2 <- 
  c("Dec", "Apr", "Jam", "Mar")

y2 <- factor(x2, 
             levels = months)
y2

factor(x1)

Factors

Creating factors with factor()

  • Sometimes we’d prefer that the order of the levels match the order of the first appearance in the data.

  • We can do that when creating the factor by setting levels to unique().

  • If we ever need to access the set of valid levels directly, we can do so with levels().

x1

f1 <- factor(x1, levels = unique(x1))
f1

levels(f1)

Factors

General Social Survey

  • We’re going to focus on the data frame, forcats::gss_cat.which is a sample of data from the General Social Survey.

  • When factors are stored in a data frame, we can see them with count().

gss_cat

gss_cat |>
  [?](race)

Factors

Modifying factor order

  • It’s often useful to change the order of the factor levels in a visualization.

  • Imagine we want to explore the average number of hours spent watching TV per day across relig:

relig_summary <- gss_cat |>
  group_by([?]) |>
  summarize(
    age = [?](age, [?]),
    tvhours = [?](tvhours, [?]),
    n = [?]
  )
ggplot(relig_summary, 
       aes(tvhours, relig)) + 
  [?]()

Factors

Modifying factor order

  • We can reorder the levels using fct_reorder(f, x, fun), which can take three arguments.

  • f: the factor whose levels we want to modify.

  • x: a numeric vector that we want to use to reorder the levels.

  • Optionally, fun: a function that’s used if there are multiple values of x for each value of f. The default value is median.

relig_summary |>
  mutate(relig = [?](relig, tvhours)) |>
  ggplot(aes(tvhours, relig)) + [?]()

Factors

Modifying factor order

  • We can use relevel() to set the first level (reference level).
  • relevel(x, ref = ...) takes at least the two arguments:
    • x: factor variable
    • ref: reference level or first level
rincome_summary <- gss_cat |>
  group_by([?]) |>
  summarize(
    age = [?](age, [?]),
    tvhours = [?](tvhours, [?]),
    n = [?]
  )
ggplot(rincome_summary, 
       aes(age, [?](rincome, age) ) )  + 
  geom_point()
ggplot(rincome_summary, 
       aes(age, 
           [?](rincome, 
               "Not applicable") ) ) +
  geom_point()

Strings

Strings

Combining strings

  • To count the length of string, use str_length().

  • To combine two or more strings, use str_c():

  • To control how strings are separated, add the sep.

  • To collapse a vector of strings into a single string, add the collapse.

str_length(c("a", "R for data science", NA))
str_c("x", "y", "z")
str_c(c("x", "y", "z"), sep = ", ")
str_c("prefix-", c("a", "b", "c"), "-suffix")
str_c(c("x", "y", "z"), collapse = ", ")

Strings

Subsetting strings

  • We can extract parts of a string using str_sub():

  • str_sub() takes start and end arguments which give the position of the substring:.

x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)

## negative numbers count backwards from end
str_sub(x, -3, -1)

Strings

Matching a pattern of strings

  • To determine if a character vector matches a pattern, use str_detect().
  • str_count() tells us how many matches there are in a string.
  • str_replace() and str_replace_all() allow us to replace matches with new strings.
  • str_replace_all() can perform multiple replacements by supplying a named vector.
x <- c("apple", "banana", "pear")

str_detect(x, "e")
str_count(x, "a")
str_replace(x, "a", "-")
str_replace_all(x, "a", "-")

Strings

Splitting strings

  • Use str_split() to split a string up into pieces.
sentences |>
  head(5) |> 
  str_split(" ")

Dates and times

Creating date/times

  • To get the current date or date-time we can use today() or now().
  • The lubridate functions are a combination of y, m, d, h, m, and s.
today()
now()
ymd("2017-01-31")
[?]("January 31st, 2017")
[?]("31-Jan-2017")

[?](20170131)

[?]("2017-01-31 20:11:59")
[?]("01/31/2017 08:01")

[?](20170131, [?] = "UTC")

Dates and times

Creating date/times

  • To create a date/time from individual components, use make_date() for dates, or make_datetime() for date-times:
flights |> 
  select(year, month, day, hour, minute)
  
flights |> 
  select(year, month, day, hour, minute) |> 
  mutate(departure = [?](year, month, day, hour, minute))
  
flights |> 
  select(year, month, day, hour, minute) |> 
  mutate(departure = [?](year, month, day))

Dates and times

Creating date/times

  • We can visualize time series data using lubridate functions.
flights_dt |> 
  ggplot(aes(dep_time)) + 
  geom_freqpoly(binwidth = 86400) ## 86400 seconds = 1 day
  
flights_dt |> 
  filter(dep_time < ymd(20130102)) |> 
  ggplot(aes(dep_time)) + 
  geom_freqpoly(binwidth = 600) ## 600 s = 10 minutes

Dates and times

Creating date/times

  • as_datetime() and as_date() switch between a date-time and a date.
[?](today())
[?](now())