Lecture 18

Columns - select(), rename(), relocate(), and mutate()

Byeong-Hak Choe

SUNY Geneseo

April 11, 2024

Data Transformation

Pipe (|>) Operator

  • Because the first argument is a data.frame and the output is a data.frame, dplyr verbs work well with the pipe, |>
    • Ctrl + Shift + M for Windows; command + Shift + M for Mac.
  • The pipe (|>) takes the thing on its left and passes it along to the function on its right so that
    • f(x, y) is equivalent to x |> f(y).
    • e.g., filter(DATA_FRAME, LOGICAL_STATEMENT) is equivalent to DATA_FRAME |> filter(LOGICAL_STATEMENT).
  • The easiest way to pronounce the pipe (|>) is “then”.
    • The pipe (|>) is super useful when we have a chain of data transforming operations to do.

Select variables with select()

Select variables with select()

  • It’s not uncommon to get datasets with hundreds or thousands of variables.
  • select() allows us to narrow in on the variables we’re actually interested in.
  • We can select variables by their names.
flights |> 
  select(year, month, day)
  • With select(VAR_1:VAR_2), we can select all the variables between VARIABLE_1 and VARIABLE_2, inclusively:
flights |> 
  select(year:day)
  • With select(-VARIABLES), we can remove variables:
flights |> 
  select(-(year:day))
  • There are a number of helper functions we can use within select():

    • starts_with("abc"): matches names that begin with “abc”.

    • ends_with("xyz"): matches names that end with “xyz”.

    • contains("ijk"): matches names that contain “ijk”.

    • num_range("x", 1:3): matches x1, x2 and x3.

Rename variables with rename()

Rename variables with rename()

flights |> 
  rename( tail_num = tailnum )
  • rename() can be used to rename variables:

    • DATA_FRAME |> rename(NEW_VARIABLE = EXISTING_VARIABLE)

Relocate variables with relocate()

Relocate variables with relocate()

flights |> 
  relocate(time_hour, air_time)
  
flights |> 
  relocate(year:dep_time, 
           .before = time_hour)
flights |> 
  relocate(starts_with("arr"), 
           .after = dep_time)
  • We can use relocate() to move variables around.
    • We can specify .before and .after arguments to choose where to put variables.

Add new variables with mutate()

Add new variables with mutate()

Arithmetic operations

  • mutate() is useful to add new variables that are functions of existing variables.
    • New variables can be a result of arithmetic operations.
    • Arithmetic operators: +, -, *, /, ^
    • Modular arithmetic: %/% (integer division) and %% (remainder).
flights |> 
  select(dep_time) |> 
  mutate(
    hour = dep_time %/% 100,
    minute = dep_time %% 100
    )

Add new variables with mutate()

  • A new variable can be based on the new variable within the mutate() function.
flights |> 
  select(year:day, ends_with("delay"), air_time) |> 
  mutate(gain = dep_delay - arr_delay,
         hours = air_time / 60,
         gain_per_hour = gain / hours ) 

Add new variables with mutate()

  • We can use the .before argument to add the variables to the position of a column:
flights |> 
  mutate(
    gain = dep_delay - arr_delay,
    speed = distance / air_time * 60,
    .before = 1  # try different position numbers.
  )
  • The . is a sign that .before is an argument to the function, not the name of variable.
  • In both .before and .after, we can use the variable name instead of a position number.
flights |> 
  mutate(
    gain = dep_delay - arr_delay,
    speed = distance / air_time * 60,
    .after = day  
  )

Add new variables with mutate()

Useful creation functions

  1. Offsets: lead() and lag()

  2. If-else conditions: ifelse()

  3. Ranking functions: min_rank(), dense_rank(), percent_rank(), row_number(), and more

  4. Other useful functions: log(), log10(), exp(), sqrt(), round(), factor(), as.character(), as.numeric(), as.integer(), and more

Add new variables with mutate()

1. lead() and lag()

  • Offsets: lead() and lag() allow us to refer to leading or lagging values.
df <- data.frame( x = 1:10 )

df <- df |> 
  mutate(x_lag = lag(x),
         x_lead = lead(x))
  • A change in GDP in year \(y\) and a percentage change in GDP in year \(y\) are calculated as follows:

\[ \begin{align} \Delta GDP_{y} = GDP_{y} - GDP_{y-1} \end{align} \]

\[ \begin{align} \%\Delta GDP_{y} = \frac{GDP_{y} - GDP_{y-1}}{GDP_{y}} \end{align} \]

df <- data.frame(
  Year = 2015:2022,
  GDP = c(100, 105, 109, 113, 
          118, 121, 119, 118)) 

df <- df |>            
  mutate(GDP_chg = GDP - lag(GDP),
         GDP_growth_pct = 
           100 * GDP_chg/GDP)

Add new variables with mutate()

2. ifelse()

flight_season <- flights |> 
  mutate(summer_month = ifelse(month %in% c(6, 7, 8), 
                               TRUE, 
                               FALSE))
  • To create new variables based on a condition, use ifelse()
    • ifelse(CONDITION, <if TRUE>, <else>)

Add new variables with mutate()

3. Ranking functions

rank_me <- data.frame( x = c(10, 5, 1, 5, 5, NA) )

rank_me_asce <- rank_me |> 
  mutate(x_min_rank = min_rank(x),
         x_dense_rank = dense_rank(x),
         x_row_number = row_number(x),
         x_perc_rank = percent_rank(x) )
         
rank_me_desc <- rank_me |> 
  mutate(x_min_rank = min_rank(-x), # instead of -x, we can use desc(x) 
         x_dense_rank = dense_rank(-x),
         x_row_number = row_number(-x), 
         x_perc_rank = percent_rank(-x) )
  • To create new variables based on an order of values: min_rank(), dense_rank(), row_number(), percent_rank(), and more

Add new variables with mutate()

4. Other useful functions

df <- data.frame( x = c(1:10) ) |> 
  mutate(x_log = log(x),
         x_log10 = log10(x),
         x_exp = exp(x),
         x_sqrt = sqrt(x),
         x_sqrt_round = round(x_sqrt, 2),
         x_fct = factor(x),
         x_chr = as.character(x),
         x_num = as.numeric(x),
         x_int = as.integer(x) )
  • We can use math functions as well as as.DATATYPE functions:
    • log(), log10(), exp(), sqrt(), round(VAR, digit), factor(), as.character(), as.numeric(), as.integer(), and more

Columns: select(), rename(), relocate(), and mutate()

Let’s do Classwork 10!