Lecture 19

Add a new variable with mutate()

Byeong-Hak Choe

SUNY Geneseo

April 16, 2024

Announcement

Announcement

Homework Assignments

  • There will be two more homework assignments.

  • The single lowest homework score will be dropped when calculating the total homework score.

  • Each homework except for the homework with the lowest score accounts for 25% of the total homework score.

Announcement

Team Project

  • A team can be composed of two, three, four, or five members.

  • One representative for a team must send Byeong-Hak an email at bchoe@geneseo with the following details by April 23, Tuesday, 11:59 P.M:

    • Subject Line: DANL-200-team
    • Email Body:
      • Member 1: Last Name, First Name
      • Member 2: Last Name, First Name
      • Member 3: Last Name, First Name (optional)
      • Member 4: Last Name, First Name (optional)
      • Member 5: Last Name, First Name (optional)

Announcement

Team Project

  • The project is about presenting your findings through data analysis using skim, ggplot, and dplyr functions on your personal website.

  • Think of this project as a more comprehensive homework assignment.

  • Details of the project will be announced next week.

    • Each team must choose one data.frame for the project.
    • It is okay to use more than one single data.frames if the data.frames are related.
    • While I plan to provide a list of data.frames for the project, I give each team an option to freely choose own data.frame for the project upon Choe’s approval.
  • While this is team work, your website and its corresponding GitHub repository will be independently assessed for the project.

Announcement

Schedule

  • Homework 4 Due: April 30, 2024, Tuesday
  • Homework 5 Due: May 9, 2024, Thursday
  • Final Exam: May 14, 2024, Friday, 3:30 P.M.-5:30 P.M.
  • Project Due: May 16, 2024, Thursday, 11:59 P.M.

Data Transformation

Pipe (|>) Operator

  • Because the first argument is a data.frame and the output is a data.frame, dplyr verbs work well with the pipe, |>
    • Ctrl + Shift + M for Windows; command + Shift + M for Mac.
  • The pipe (|>) takes the thing on its left and passes it along to the function on its right so that
    • f(x, y) is equivalent to x |> f(y).
    • e.g., filter(DATA_FRAME, LOGICAL_STATEMENT) is equivalent to DATA_FRAME |> filter(LOGICAL_STATEMENT).
  • The easiest way to pronounce the pipe (|>) is “then”.
    • The pipe (|>) is super useful when we have a chain of data transforming operations to do.

Data Transformation

dplyr basics

  • data.frame |> filter(LOGICAL_CONDITIONS)

  • data.frame |> arrange(VARIABLES)

  • data.frame |> distinct(VARIABLES)

  • data.frame |> select(VARIABLES)

  • data.frame |> rename(NEW_VARIABLE = EXISTING_VARIABLE)

  • data.frame |> mutate(NEW_VARIABLE = ... )

  • data.frame |> relocate(VARIABLES)

  • data.frame |> group_by(VARIABLES)

  • data.frame |> summarize(NEW_VARIABLE = ...)

  • The subsequent arguments describe what to do with the data.frame, mostly using the variable names.

  • The result is a data.frame.

Add new variables with mutate()

Add new variables with mutate()

Arithmetic operations

  • mutate() is useful to add new variables that are functions of existing variables.
    • New variables can be a result of arithmetic operations.
    • Arithmetic operators: +, -, *, /, ^
    • Modular arithmetic: %/% (integer division) and %% (remainder).
flights |> 
  select(dep_time) |> 
  mutate(
    hour = dep_time %/% 100,
    minute = dep_time %% 100
    )

Add new variables with mutate()

  • A new variable can be based on the new variable within the mutate() function.
flights |> 
  select(year:day, ends_with("delay"), air_time) |> 
  mutate(gain = dep_delay - arr_delay,
         hours = air_time / 60,
         gain_per_hour = gain / hours ) 

Add new variables with mutate()

  • We can use the .before argument to add the variables to the position of a column:
flights |> 
  mutate(
    gain = dep_delay - arr_delay,
    speed = distance / air_time * 60,
    .before = 1  # try different position numbers.
  )
  • The . is a sign that .before is an argument to the function, not the name of variable.
  • In both .before and .after, we can use the variable name instead of a position number.
flights |> 
  mutate(
    gain = dep_delay - arr_delay,
    speed = distance / air_time * 60,
    .after = day  
  )

Add new variables with mutate()

Useful creation functions

  1. Offsets: lead() and lag()

  2. If-else conditions: ifelse()

  3. Ranking functions: min_rank(), dense_rank(), percent_rank(), row_number(), and more

  4. Other useful functions: log(), log10(), exp(), sqrt(), round(), as.character(), as.numeric(), as.integer(), and more

  5. Factor-related functions: factor(), fct_reorder(), and more

  6. String-related functions: str_detect(), str_replace(), str_replace_all(), str_sub(), and more

Add new variables with mutate()

1. lead() and lag()

  • Offsets: lead() and lag() allow us to refer to leading or lagging values.
df <- data.frame( x = 1:10 )

df <- df |> 
  mutate(x_lag = lag(x),
         x_lead = lead(x))
  • A change in GDP in year \(y\) and a percentage change in GDP in year \(y\) are calculated as follows:

\[ \begin{align} \Delta GDP_{y} = GDP_{y} - GDP_{y-1} \end{align} \]

\[ \begin{align} \%\Delta GDP_{y} = \frac{GDP_{y} - GDP_{y-1}}{GDP_{y}} \end{align} \]

df <- data.frame(
  Year = 2015:2022,
  GDP = c(100, 105, 109, 113, 
          118, 121, 119, 118)) 

df <- df |>            
  mutate(GDP_chg = GDP - lag(GDP),
         GDP_growth_pct = 
           100 * GDP_chg/GDP)

Add new variables with mutate()

2. ifelse()

flight_season <- flights |> 
  mutate(summer_month = ifelse(month %in% c(6, 7, 8), 
                               TRUE, 
                               FALSE))
  • To create new variables based on a condition, use ifelse()
    • ifelse(CONDITION, <if TRUE>, <else>)

Add new variables with mutate()

3. Ranking functions

rank_me <- data.frame( x = c(10, 5, 1, 5, 5, NA) )

rank_me_asce <- rank_me |> 
  mutate(x_min_rank = min_rank(x),
         x_dense_rank = dense_rank(x),
         x_row_number = row_number(x),
         x_perc_rank = percent_rank(x) )
         
rank_me_desc <- rank_me |> 
  mutate(x_min_rank = min_rank(-x), # instead of -x, we can use desc(x) 
         x_dense_rank = dense_rank(-x),
         x_row_number = row_number(-x), 
         x_perc_rank = percent_rank(-x) )
  • To create new variables based on an order of values: min_rank(), dense_rank(), row_number(), percent_rank(), and more

Add new variables with mutate()

4. Other useful functions

df <- data.frame( x = c(1:10) ) |> 
  mutate(x_log = log(x),
         x_log10 = log10(x),
         x_exp = exp(x),
         x_sqrt = sqrt(x),
         x_sqrt_round = round(x_sqrt, 2),
         x_fct = factor(x),
         x_chr = as.character(x),
         x_num = as.numeric(x),
         x_int = as.integer(x) )
  • We can use math functions as well as as.DATATYPE functions:
    • log(), log10(), exp(), sqrt(), round(VAR, digit), factor(), as.character(), as.numeric(), as.integer(), and more

Columns: select(), rename(), relocate(), and mutate()

Let’s do Question 1 and Question 2 up to Q2e in Classwork 10!

Add new variables with mutate()

5. Factors

df <- data.frame(
  city = c("Rochester", "Buffalo", 
           "Geneseo", "Syracus"),
  income = c(80,  82,  70, 75) )

df <- df |> 
  mutate( city_fct = factor(city) )

df$city
df$city_fct
  • In R, factors are categorical variables, variables that have a fixed and known set of possible values.
  • We can use a factor variable to sort categories in a useful way.
NY_cities <- c("Geneseo", "Rochester", 
               "Buffalo", "Syracus")
df <- df |> 
  mutate(city_fct_new = 
           factor(city, 
                  levels = NY_cities)
  )
  
levels(df$city_fct)
levels(df$city_fct_new)
  • If we ever need to set the order of levels directly, we can do so with levels.
df <- df |> 
        mutate(city_reorder = fct_reorder(city_fct, income) )
  • We can reorder the levels using fct_reorder(f, x, fun), which can take three arguments.
    • f: the factor whose levels we want to modify.
    • x: a numeric vector that we want to use to reorder the levels.
    • Optionally, fun: a function that’s used if there are multiple values of x for each value of f.
    • The default value for fun is median.

Add new variables with mutate()

ggplot with a factor variable - Sorted bar chart/dot plot

  • It’s often useful to change the order of the factor levels in a visualization.
  • Imagine we want to explore the average income across cities.

Add new variables with mutate()

6. Strings

df_str <- data.frame(
  fruit = c("apple", "banana", "pear")
)

df_str <- df_str |> 
  mutate( fruit_e = str_detect(fruit, "e")
  )
  • str_detect() returns TRUE if a character value matches a pattern. Otherwise, FALSE.
df_str <- df_str |> 
  mutate(fruit_replace = str_replace(fruit, "a", "-"),
         fruit_replace_all = str_replace_all(fruit, "a", "-")
         )
  • str_replace() and str_replace_all() allow us to replace matches with new strings.
  • str_replace_all() can perform multiple replacements by supplying a named vector.
df_str <- df_str |> 
  mutate(fruit_sub1 = str_sub(fruit, 1, 3),
         fruit_sub2 = str_sub(fruit, -3, -1)
  )
  • We can extract parts of a string using str_sub().

  • str_sub() takes start and end arguments which give the position of the substring.

Columns: select(), rename(), relocate(), and mutate()

Let’s do Q2f and Q2g in Classwork 10!