Lecture 18

Columns - select(), rename(), relocate(), and mutate()

Byeong-Hak Choe

bchoe@geneseo.edu

SUNY Geneseo

April 11, 2024

Data Transformation

Pipe (`|>`) Operator

Because the first argument is a data.frame and the output is a data.frame, dplyr verbs work well with the pipe, |>
- Ctrl + Shift + M for Windows; command + Shift + M for Mac.
The pipe (|>) takes the thing on its left and passes it along to the function on its right so that
- f(x, y) is equivalent to x |> f(y).
- e.g., filter(DATA_FRAME, LOGICAL_STATEMENT) is equivalent to DATA_FRAME |> filter(LOGICAL_STATEMENT).
The easiest way to pronounce the pipe (|>) is “then”.
- The pipe (|>) is super useful when we have a chain of data transforming operations to do.

Select variables with `select()`

It’s not uncommon to get datasets with hundreds or thousands of variables.
select() allows us to narrow in on the variables we’re actually interested in.

Basic
VAR_1:VAR_2
Removal
Select Helpers

We can select variables by their names.

flights |> 
  select(year, month, day)

With select(VAR_1:VAR_2), we can select all the variables between VARIABLE_1 and VARIABLE_2, inclusively:

flights |> 
  select(year:day)

With select(-VARIABLES), we can remove variables:

flights |> 
  select(-(year:day))

There are a number of helper functions we can use within select():
- starts_with("abc"): matches names that begin with “abc”.
- ends_with("xyz"): matches names that end with “xyz”.
- contains("ijk"): matches names that contain “ijk”.
- num_range("x", 1:3): matches x1, x2 and x3.

Rename variables with `rename()`

flights |> 
  rename( tail_num = tailnum )

rename() can be used to rename variables:
- DATA_FRAME |> rename(NEW_VARIABLE = EXISTING_VARIABLE)

Relocate variables with `relocate()`

flights |> 
  relocate(time_hour, air_time)

flights |> 
  relocate(year:dep_time, 
           .before = time_hour)

flights |> 
  relocate(starts_with("arr"), 
           .after = dep_time)

We can use relocate() to move variables around.
- We can specify .before and .after arguments to choose where to put variables.

Add new variables with `mutate()`

Arithmetic operations

mutate() is useful to add new variables that are functions of existing variables.
- New variables can be a result of arithmetic operations.
- Arithmetic operators: +, -, *, /, ^
- Modular arithmetic: %/% (integer division) and %% (remainder).

flights |> 
  select(dep_time) |> 
  mutate(
    hour = dep_time %/% 100,
    minute = dep_time %% 100
    )

Add new variables with `mutate()`

A new variable can be based on the new variable within the mutate() function.

flights |> 
  select(year:day, ends_with("delay"), air_time) |> 
  mutate(gain = dep_delay - arr_delay,
         hours = air_time / 60,
         gain_per_hour = gain / hours )

We can use the .before argument to add the variables to the position of a column:

flights |> 
  mutate(
    gain = dep_delay - arr_delay,
    speed = distance / air_time * 60,
    .before = 1  # try different position numbers.
  )

The . is a sign that .before is an argument to the function, not the name of variable.

In both .before and .after, we can use the variable name instead of a position number.

flights |> 
  mutate(
    gain = dep_delay - arr_delay,
    speed = distance / air_time * 60,
    .after = day  
  )

Add new variables with `mutate()`

Useful creation functions

Offsets: lead() and lag()
If-else conditions: ifelse()
Ranking functions: min_rank(), dense_rank(), percent_rank(), row_number(), and more
Other useful functions: log(), log10(), exp(), sqrt(), round(), factor(), as.character(), as.numeric(), as.integer(), and more

Add new variables with `mutate()`

1. `lead()` and `lag()`

Offsets: lead() and lag() allow us to refer to leading or lagging values.

Offsets
Change

df <- data.frame( x = 1:10 )

df <- df |> 
  mutate(x_lag = lag(x),
         x_lead = lead(x))

A change in GDP in year \(y\) and a percentage change in GDP in year \(y\) are calculated as follows:

\[ \begin{align} \Delta GDP_{y} = GDP_{y} - GDP_{y-1} \end{align} \]

\[ \begin{align} \%\Delta GDP_{y} = \frac{GDP_{y} - GDP_{y-1}}{GDP_{y}} \end{align} \]

df <- data.frame(
  Year = 2015:2022,
  GDP = c(100, 105, 109, 113, 
          118, 121, 119, 118)) 

df <- df |>            
  mutate(GDP_chg = GDP - lag(GDP),
         GDP_growth_pct = 
           100 * GDP_chg/GDP)

Add new variables with `mutate()`

2. `ifelse()`

flight_season <- flights |> 
  mutate(summer_month = ifelse(month %in% c(6, 7, 8), 
                               TRUE, 
                               FALSE))

To create new variables based on a condition, use ifelse()
- ifelse(CONDITION, <if TRUE>, <else>)

Add new variables with `mutate()`

3. Ranking functions

rank_me <- data.frame( x = c(10, 5, 1, 5, 5, NA) )

rank_me_asce <- rank_me |> 
  mutate(x_min_rank = min_rank(x),
         x_dense_rank = dense_rank(x),
         x_row_number = row_number(x),
         x_perc_rank = percent_rank(x) )
         
rank_me_desc <- rank_me |> 
  mutate(x_min_rank = min_rank(-x), # instead of -x, we can use desc(x) 
         x_dense_rank = dense_rank(-x),
         x_row_number = row_number(-x), 
         x_perc_rank = percent_rank(-x) )

To create new variables based on an order of values: min_rank(), dense_rank(), row_number(), percent_rank(), and more

Add new variables with `mutate()`

4. Other useful functions

df <- data.frame( x = c(1:10) ) |> 
  mutate(x_log = log(x),
         x_log10 = log10(x),
         x_exp = exp(x),
         x_sqrt = sqrt(x),
         x_sqrt_round = round(x_sqrt, 2),
         x_fct = factor(x),
         x_chr = as.character(x),
         x_num = as.numeric(x),
         x_int = as.integer(x) )

We can use math functions as well as as.DATATYPE functions:
- log(), log10(), exp(), sqrt(), round(VAR, digit), factor(), as.character(), as.numeric(), as.integer(), and more

Columns: `select()`, `rename()`, `relocate()`, and `mutate()`

Let’s do Classwork 10!

Lecture 18

Data Transformation

Pipe (|>) Operator

Select variables with select()

Select variables with select()

Rename variables with rename()

Rename variables with rename()

Relocate variables with relocate()

Relocate variables with relocate()

Add new variables with mutate()

Add new variables with mutate()

Arithmetic operations

Add new variables with mutate()

Add new variables with mutate()

Add new variables with mutate()

Useful creation functions

Add new variables with mutate()

1. lead() and lag()

Add new variables with mutate()

2. ifelse()

Add new variables with mutate()

3. Ranking functions

Add new variables with mutate()

4. Other useful functions

Columns: select(), rename(), relocate(), and mutate()

Pipe (`|>`) Operator

Select variables with `select()`

Select variables with `select()`

Rename variables with `rename()`

Rename variables with `rename()`

Relocate variables with `relocate()`

Relocate variables with `relocate()`

Add new variables with `mutate()`

Add new variables with `mutate()`

Add new variables with `mutate()`

Add new variables with `mutate()`

Add new variables with `mutate()`

Add new variables with `mutate()`

1. `lead()` and `lag()`

Add new variables with `mutate()`

2. `ifelse()`

Add new variables with `mutate()`

Add new variables with `mutate()`

Columns: `select()`, `rename()`, `relocate()`, and `mutate()`