Data Transformation Basics
February 24, 2024
It is rare that we get the data in exactly the right form we need.
We’ll need to create some new variables, summaries, rename the variables or reorder the observations in order to make the data a little easier to work with.
We’ll discuss how to transform data using the dplyr
package and a dataset on flights departing New York City in 2013.
dplyr
basicsdplyr
functions to solve various data manipulation challenges:
filter()
).arrange()
).select()
).mutate()
).summarize()
).group_by()
).dplyr
basicsfilter(<DATA.FRAME>, LOGICAL_STATEMENT)
arrange(<DATA.FRAME>, VARIABLES)
select(<DATA.FRAME>, VARIABLES)
group_by(<DATA.FRAME>, VARIABLES)
mutate(<DATA.FRAME>, NEW_VARIABLE = ... )
summarize(<DATA.FRAME>, NEW_VARIABLE = ...)
The first argument is a data.frame.
The subsequent arguments describe what to do with the data.frame, mostly using the variable names.
The result is also a data.frame.
data.frame
and the output is a data.frame
, dplyr
verbs work well with the pipe, |>
|>
) takes the thing on its left and passes it along to the function on its right so that
f(x, y)
is equivalent to x |> f(y)
.filter(<DATA.FRAME>, LOGICAL_STATEMENT)
is equivalent to <DATA.FRAME> |> filter(LOGICAL_STATEMENT)
.|>
) is “then”.
|>
) is super useful when we have a chain of data transforming operations to do.filter()
filter()
filter()
allows us to subset observations based on the value of logical conditions, which are either TRUE
or FALSE
.filter()
logical
conditionsConditions are expressions that evaluate as logical
Operator | Description |
---|---|
x == y | Is x equal to y? |
x != y | Is x not equal to y? |
x > y | Is x greater than y? |
x >= y | Is x greater than or equal to y? |
x < y | Is x less than y? |
x <= y | Is x less than or equal to y? |
filter()
logical
conditions==
is an operator that compares the objects on either side and returns TRUE
if they have the same values.Q. What does ! (!F)
evaluate to?
filter()
!(x & y)
is the same as !x | !y
.
!(x | y)
is the same as !x & !y
.
Comparison and logical operators can be used with filter()
:
filter()
NA
)NA
) will also be unknown.x
be Mary’s age. We don’t know how old she is.y
be John’s age. We don’t know how old he is.filter()
IAH
or HOU
)arrange()
arrange()
arrange()
changes their order.
If we provide more than one column name, each additional column will be used to break ties in the values of preceding columns.
Use desc()
to re-order by a column in descending order.
arrange()
How could you use arrange()
to sort all missing values to the start? (Hint: use is.na()
).
Sort flights to find the most delayed flights. Find the flights that left earliest.
Sort flights to find the fastest (highest speed) flights.
Which flights traveled the farthest? Which traveled the shortest?
select()
select()
select()
allows us to narrow in on the variables we’re actually interested in.select()
There are a number of helper functions we can use within select()
:
starts_with("abc")
: matches names that begin with “abc”.
ends_with("xyz")
: matches names that end with “xyz”.
contains("ijk")
: matches names that contain “ijk”.
num_range("x", 1:3)
: matches x1, x2 and x3.
select()
rename()
rename()
can be used to rename variables:
rename(<DATA>, <NEW_VARIABLE> = <EXISTING_VARIABLE>)
We can use relocate()
to move variables around.
We can also use select()
with everything()
to move variables around.
mutate()
mutate()
mutate()
to add variables to the data.frame:.before
argument to add the variables to the position of a column:.
is a sign that .before
is an argument to the function, not the name of variable.mutate()
There are many functions for creating new variables that we can use with mutate()
.
Offsets: lead()
and lag()
allow us to refer to leading or lagging values.
Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: cumsum()
, cummean()
, cummin()
, cummax()
, and cumprod()
.
mutate()
There are many functions for creating new variables that we can use with mutate()
.
Ranking: min_rank()
, dense_rank()
, row_number()
, percent_rank()
, and cume_dist()
.
mutate()
if_else()
and ifelse()
functions.
if_else(<condition>, <if TRUE>, <else>)
ifelse(<condition>, <if TRUE>, <else>)
flight_season <- flights |> mutate(
summer_month = if_else(month %in% c(6, 7, 8), TRUE, FALSE))
flight_season <- flights |> mutate(
summer_month = if_else(month %in% c(6, 7, 8), TRUE, 0))
flight_season <- flights |> mutate( ## ifelse() is less restrictive
summer_month = ifelse(month %in% c(6, 7, 8), TRUE, 0))
filter()
, arrange()
, select()
, and mutate()
Find all flights that arrived more than two hours late, but didn’t leave late.
Sort flights
to find the fastest (highest speed) flights.
Brainstorm as many ways as possible to select dep_time
, dep_delay
, arr_time
, and arr_delay
from flights.
Find the 10 most delayed flights using a ranking function. How do you want to handle ties?
summarize()
summarize()
summarize()
collapses a data frame to a single row.na.rm
argument removes the missing values prior to computation when using summarize()
or other aggregate functions.summarize()
is not terribly useful unless we pair it with group_by()
.summarize()
distance
and the mean level arr_delay
for each value of dest
.by_dest <- flights |> group_by(dest)
delay <- by_dest |> summarize(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
)
delay <- delay |>
filter(count > 20, dest != "HNL")
ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
geom_point(aes(size = count), alpha = 1/3) +
geom_smooth(se = FALSE)
summarize()
na.rm
?summarize()
summarize()
tailnum
)) that have the highest average delayssummarize()
n()
), or a count of non-missing values (sum(!is.na(x))
).summarize()
Measures of location: mean(x)
and median(x)
.
Measures of spread: sd(x)
, IQR(x)
.
not_cancelled |>
group_by(year, month, day) |>
summarize( avg_delay1 = mean(arr_delay),
avg_delay2 = mean(arr_delay[arr_delay > 0]) ) ## the average positive delay
## Brackets[] lets us filter (or subset) data from a column
## Why is distance to some destinations more variable than to others?
not_cancelled |>
group_by(dest) |>
summarize(distance_sd = sd(distance)) |>
arrange(desc(distance_sd))
summarize()
Measures of rank: min(x)
, quantile(x, 0.25)
, max(x)
, range(x)
.
Measures of position: first(x)
, nth(x, 2)
, last(x)
.
When do the first and last flights leave each day?
summarize()
sum(!is.na(x))
, n_distinct(x)
, count(x)
, and count(x, wt = VARIABLE)
.summarize()
dplyr::count()
and ggplot2::cut_width()
:summarize()
sum(x > 10)
, mean(y == 0)
.## How many flights left before 5am?
## (these usually indicate delayed flights from the previous day)
not_cancelled |>
group_by(year, month, day) |>
summarize(n_early = sum(dep_time < 500))
## What proportion of flights are delayed by more than an hour?
not_cancelled |>
group_by(year, month, day) |>
summarize(hour_prop = mean(arr_delay > 60))
summarize()
summarize()
ungroup()
.summarize()
Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios:
A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
A flight is always 10 minutes late.
A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
99% of the time a flight is on time. 1% of the time it’s 2 hours late.
Q. Which is more important: arrival delay or departure delay?
summarize()
Come up with another approach that will give you the same output as not_cancelled
|>
count(dest)
and not_cancelled |> count(tailnum, wt = distance)
(without using count()
).
Our definition of cancelled flights (is.na(dep_delay) | is.na(arr_delay)
) is slightly suboptimal. Why? Which is the most important column?
Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?
summarize()
Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights |> group_by(carrier, dest) |> summarize(n())
)
What does the sort argument to count()
do. When might you use it?
summarize()
, but you can also do convenient operations with mutate()
and filter()
.Refer back to the lists of useful mutate and filtering functions. Describe how each operation changes when you combine it with grouping.
Which plane (tailnum
) has the worst on-time record?
What time of day should you fly if you want to avoid delays as much as possible?
For each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination.
Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using lag()
, explore how the delay of a flight is related to the delay of the immediately preceding flight.
Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error). Compute the air time of a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?
Find all destinations that are flown by at least two carriers. Use that information to rank the carriers.
For each plane, count the number of flights before the first delay of greater than 1 hour.
There are three rules which make a dataset tidy:
The first step is always to figure out what form of the variables and observations we need.
The second step is to resolve one of two common problems:
One variable might be spread across multiple columns.
One observation might be scattered across multiple rows.
To fix these problems, we may need the two most important functions in the tidyr
package: pivot_longer()
and pivot_wider()
.
pivot_longer()
when a variable might be spread across multiple columns.To tidy a dataset like table4a
, we need to pivot the offending columns into a new pair of variables.
To use pivot_longer()
, we need to the following three parameters:
cols
) whose names are values, not variables.names_to
).values_to
).pivot_wider()
when an observation is scattered across multiple rows.pivot_wider()
, we need two parameters:
names_prefix
, for the prefix of column names.###separate
table3
has one column (rate
) that contains two variables (cases
and population
).
separate()
takes the name of the column to separate, and the names of the columns to separate into.
###separate
sep
parameter.###separate
separate()
leaves the type of the column as is.###separate
separate()
will interpret the integers as positions to split at.
We can also pass a vector of integers to sep
.
separate()
will interpret the integer as positions to split at.
unite
unite()
combines multiple columns into a single column.
The default will place an underscore (_
) between the values from different columns.
rbind()
can row-bind data.frames.cbind()
can column-bind data.frames.rbind()
does not work if the two data.frames have different names of columns.cbind()
does not work if the two data.frames have different numbers of rows.It’s rare that a data analysis involves only a single data frame.
Collectively, multiple data frames are called relational data.
To work with relational data, we need verbs that work with pairs of data frames.
join methods add new variables to one data frame from matching observations in another data frame.
nycflights13
nycflights13
contains four data frames that are related to the data frame, flights
, that we used in data transformation.nycflights13
flights
connects to …
planes
via a single variable, tailnum
.
airlines
through the carrier
variable.
airports
in two ways: via the origin
and dest
variables.
weather
via origin
(the location), and year
, month
, day
and hour
(the time).
nycflights13
A key variable (or a set of key variables) is a variable (or a set of variables) that uniquely identifies an observation.
So, a key variable (or a set of key variables) is used to connect relational data.frames.
The name of a key variable can be different across relational data.frames.
Tidyverse | SQL | Description |
---|---|---|
left | left outer | Keep all the observations from the left |
right | right outer | Keep all the observations from the right |
full | full outer | Keep all the observations from both left and right |
inner | inner | Keep only the observations whose key values exist in both |
x
.airlines
and airports
);flights
and airplanes
).:::: {.columns} ::: {.column width=“50%”}
by = "a"
: uses only variable a
.by = c("a" = "b")
: match variable a
in data frame x
to variable b
in data frame y
.R
, factors are categorical variables, variables that have a fixed and known set of possible values.x1
has two problems:
factor()
factor()
.levels
.NA
.levels
, they’ll be taken from the data in alphabetical order:factor()
Sometimes we’d prefer that the order of the levels match the order of the first appearance in the data.
We can do that when creating the factor by setting levels to unique()
.
If we ever need to access the set of valid levels directly, we can do so with levels()
.
We’re going to focus on the data frame, forcats::gss_cat
.which is a sample of data from the General Social Survey.
When factors are stored in a data frame, we can see them with count()
.
It’s often useful to change the order of the factor levels in a visualization.
Imagine we want to explore the average number of hours spent watching TV per day across relig
:
We can reorder the levels using fct_reorder(f, x, fun)
, which can take three arguments.
f
: the factor whose levels we want to modify.
x
: a numeric vector that we want to use to reorder the levels.
Optionally, fun
: a function that’s used if there are multiple values of x
for each value of f
. The default value is median.
relevel()
to set the first level (reference level).relevel(x, ref = ...)
takes at least the two arguments:
x
: factor variableref
: reference level or first levelTo count the length of string, use str_length()
.
To combine two or more strings, use str_c()
:
To control how strings are separated, add the sep
.
To collapse a vector of strings into a single string, add the collapse
.
We can extract parts of a string using str_sub()
:
str_sub()
takes start
and end
arguments which give the position of the substring:.
str_detect()
.str_count()
tells us how many matches there are in a string.str_replace()
and str_replace_all()
allow us to replace matches with new strings.str_replace_all()
can perform multiple replacements by supplying a named vector.str_split()
to split a string up into pieces.today()
or now()
.lubridate
functions are a combination of y
, m
, d
, h
, m
, and s
.make_date()
for dates, or make_datetime()
for date-times:lubridate
functions.as_datetime()
and as_date()
switch between a date-time and a date.