Lecture 19

Add a new variable with mutate()

Byeong-Hak Choe

bchoe@geneseo.edu

SUNY Geneseo

April 16, 2024

Announcement

Homework Assignments

There will be two more homework assignments.
The single lowest homework score will be dropped when calculating the total homework score.
Each homework except for the homework with the lowest score accounts for 25% of the total homework score.

Announcement

Team Project

A team can be composed of two, three, four, or five members.
One representative for a team must send Byeong-Hak an email at bchoe@geneseo with the following details by April 23, Tuesday, 11:59 P.M:
- Subject Line: DANL-200-team
- Email Body:
  - Member 1: Last Name, First Name
  - Member 2: Last Name, First Name
  - Member 3: Last Name, First Name (optional)
  - Member 4: Last Name, First Name (optional)
  - Member 5: Last Name, First Name (optional)

Announcement

Team Project

The project is about presenting your findings through data analysis using skim, ggplot, and dplyr functions on your personal website.
Think of this project as a more comprehensive homework assignment.
Details of the project will be announced next week.
- Each team must choose one data.frame for the project.
- It is okay to use more than one single data.frames if the data.frames are related.
- While I plan to provide a list of data.frames for the project, I give each team an option to freely choose own data.frame for the project upon Choe’s approval.
While this is team work, your website and its corresponding GitHub repository will be independently assessed for the project.

Announcement

Schedule

Homework 4 Due: April 30, 2024, Tuesday
Homework 5 Due: May 9, 2024, Thursday
Final Exam: May 14, 2024, Friday, 3:30 P.M.-5:30 P.M.
Project Due: May 16, 2024, Thursday, 11:59 P.M.

Data Transformation

Pipe (`|>`) Operator

Because the first argument is a data.frame and the output is a data.frame, dplyr verbs work well with the pipe, |>
- Ctrl + Shift + M for Windows; command + Shift + M for Mac.
The pipe (|>) takes the thing on its left and passes it along to the function on its right so that
- f(x, y) is equivalent to x |> f(y).
- e.g., filter(DATA_FRAME, LOGICAL_STATEMENT) is equivalent to DATA_FRAME |> filter(LOGICAL_STATEMENT).
The easiest way to pronounce the pipe (|>) is “then”.
- The pipe (|>) is super useful when we have a chain of data transforming operations to do.

Data Transformation

`dplyr` basics

data.frame |> filter(LOGICAL_CONDITIONS)
data.frame |> arrange(VARIABLES)
data.frame |> distinct(VARIABLES)
data.frame |> select(VARIABLES)
data.frame |> rename(NEW_VARIABLE = EXISTING_VARIABLE)
data.frame |> mutate(NEW_VARIABLE = ... )
data.frame |> relocate(VARIABLES)
data.frame |> group_by(VARIABLES)
data.frame |> summarize(NEW_VARIABLE = ...)
The subsequent arguments describe what to do with the data.frame, mostly using the variable names.
The result is a data.frame.

Add new variables with `mutate()`

Arithmetic operations

mutate() is useful to add new variables that are functions of existing variables.
- New variables can be a result of arithmetic operations.
- Arithmetic operators: +, -, *, /, ^
- Modular arithmetic: %/% (integer division) and %% (remainder).

flights |> 
  select(dep_time) |> 
  mutate(
    hour = dep_time %/% 100,
    minute = dep_time %% 100
    )

Add new variables with `mutate()`

A new variable can be based on the new variable within the mutate() function.

flights |> 
  select(year:day, ends_with("delay"), air_time) |> 
  mutate(gain = dep_delay - arr_delay,
         hours = air_time / 60,
         gain_per_hour = gain / hours )

We can use the .before argument to add the variables to the position of a column:

flights |> 
  mutate(
    gain = dep_delay - arr_delay,
    speed = distance / air_time * 60,
    .before = 1  # try different position numbers.
  )

The . is a sign that .before is an argument to the function, not the name of variable.

In both .before and .after, we can use the variable name instead of a position number.

flights |> 
  mutate(
    gain = dep_delay - arr_delay,
    speed = distance / air_time * 60,
    .after = day  
  )

Add new variables with `mutate()`

Useful creation functions

Offsets: lead() and lag()
If-else conditions: ifelse()
Ranking functions: min_rank(), dense_rank(), percent_rank(), row_number(), and more
Other useful functions: log(), log10(), exp(), sqrt(), round(), as.character(), as.numeric(), as.integer(), and more
Factor-related functions: factor(), fct_reorder(), and more
String-related functions: str_detect(), str_replace(), str_replace_all(), str_sub(), and more

Add new variables with `mutate()`

1. `lead()` and `lag()`

Offsets: lead() and lag() allow us to refer to leading or lagging values.

Offsets
Change

df <- data.frame( x = 1:10 )

df <- df |> 
  mutate(x_lag = lag(x),
         x_lead = lead(x))

A change in GDP in year \(y\) and a percentage change in GDP in year \(y\) are calculated as follows:

\[ \begin{align} \Delta GDP_{y} = GDP_{y} - GDP_{y-1} \end{align} \]

\[ \begin{align} \%\Delta GDP_{y} = \frac{GDP_{y} - GDP_{y-1}}{GDP_{y}} \end{align} \]

df <- data.frame(
  Year = 2015:2022,
  GDP = c(100, 105, 109, 113, 
          118, 121, 119, 118)) 

df <- df |>            
  mutate(GDP_chg = GDP - lag(GDP),
         GDP_growth_pct = 
           100 * GDP_chg/GDP)

Add new variables with `mutate()`

2. `ifelse()`

flight_season <- flights |> 
  mutate(summer_month = ifelse(month %in% c(6, 7, 8), 
                               TRUE, 
                               FALSE))

To create new variables based on a condition, use ifelse()
- ifelse(CONDITION, <if TRUE>, <else>)

Add new variables with `mutate()`

3. Ranking functions

rank_me <- data.frame( x = c(10, 5, 1, 5, 5, NA) )

rank_me_asce <- rank_me |> 
  mutate(x_min_rank = min_rank(x),
         x_dense_rank = dense_rank(x),
         x_row_number = row_number(x),
         x_perc_rank = percent_rank(x) )
         
rank_me_desc <- rank_me |> 
  mutate(x_min_rank = min_rank(-x), # instead of -x, we can use desc(x) 
         x_dense_rank = dense_rank(-x),
         x_row_number = row_number(-x), 
         x_perc_rank = percent_rank(-x) )

To create new variables based on an order of values: min_rank(), dense_rank(), row_number(), percent_rank(), and more

Add new variables with `mutate()`

4. Other useful functions

df <- data.frame( x = c(1:10) ) |> 
  mutate(x_log = log(x),
         x_log10 = log10(x),
         x_exp = exp(x),
         x_sqrt = sqrt(x),
         x_sqrt_round = round(x_sqrt, 2),
         x_fct = factor(x),
         x_chr = as.character(x),
         x_num = as.numeric(x),
         x_int = as.integer(x) )

We can use math functions as well as as.DATATYPE functions:
- log(), log10(), exp(), sqrt(), round(VAR, digit), factor(), as.character(), as.numeric(), as.integer(), and more

Columns: `select()`, `rename()`, `relocate()`, and `mutate()`

Let’s do Question 1 and Question 2 up to Q2e in Classwork 10!

Add new variables with `mutate()`

5. Factors

factor()
factor() w/ levels
fct_reorder()

df <- data.frame(
  city = c("Rochester", "Buffalo", 
           "Geneseo", "Syracus"),
  income = c(80,  82,  70, 75) )

df <- df |> 
  mutate( city_fct = factor(city) )

df$city
df$city_fct

In R, factors are categorical variables, variables that have a fixed and known set of possible values.
We can use a factor variable to sort categories in a useful way.

NY_cities <- c("Geneseo", "Rochester", 
               "Buffalo", "Syracus")
df <- df |> 
  mutate(city_fct_new = 
           factor(city, 
                  levels = NY_cities)
  )
  
levels(df$city_fct)
levels(df$city_fct_new)

If we ever need to set the order of levels directly, we can do so with levels.

df <- df |> 
        mutate(city_reorder = fct_reorder(city_fct, income) )

We can reorder the levels using fct_reorder(f, x, fun), which can take three arguments.
- f: the factor whose levels we want to modify.
- x: a numeric vector that we want to use to reorder the levels.
- Optionally, fun: a function that’s used if there are multiple values of x for each value of f.
- The default value for fun is median.

Add new variables with `mutate()`

`ggplot` with a factor variable - Sorted bar chart/dot plot

It’s often useful to change the order of the factor levels in a visualization.
Imagine we want to explore the average income across cities.

Add new variables with `mutate()`

6. Strings

str_detect()
str_replace()
str_sub()

df_str <- data.frame(
  fruit = c("apple", "banana", "pear")
)

df_str <- df_str |> 
  mutate( fruit_e = str_detect(fruit, "e")
  )

str_detect() returns TRUE if a character value matches a pattern. Otherwise, FALSE.

df_str <- df_str |> 
  mutate(fruit_replace = str_replace(fruit, "a", "-"),
         fruit_replace_all = str_replace_all(fruit, "a", "-")
         )

str_replace() and str_replace_all() allow us to replace matches with new strings.
str_replace_all() can perform multiple replacements by supplying a named vector.

df_str <- df_str |> 
  mutate(fruit_sub1 = str_sub(fruit, 1, 3),
         fruit_sub2 = str_sub(fruit, -3, -1)
  )

We can extract parts of a string using str_sub().
str_sub() takes start and end arguments which give the position of the substring.

Columns: `select()`, `rename()`, `relocate()`, and `mutate()`

Let’s do Q2f and Q2g in Classwork 10!

Lecture 19

Announcement

Announcement

Homework Assignments

Announcement

Team Project

Announcement

Team Project

Announcement

Schedule

Data Transformation

Pipe (|>) Operator

Data Transformation

dplyr basics

Add new variables with mutate()

Add new variables with mutate()

Arithmetic operations

Add new variables with mutate()

Add new variables with mutate()

Add new variables with mutate()

Useful creation functions

Add new variables with mutate()

1. lead() and lag()

Add new variables with mutate()

2. ifelse()

Add new variables with mutate()

3. Ranking functions

Add new variables with mutate()

4. Other useful functions

Columns: select(), rename(), relocate(), and mutate()

Add new variables with mutate()

5. Factors

Add new variables with mutate()

ggplot with a factor variable - Sorted bar chart/dot plot

Add new variables with mutate()

6. Strings

Columns: select(), rename(), relocate(), and mutate()

Pipe (`|>`) Operator

`dplyr` basics

Add new variables with `mutate()`

Add new variables with `mutate()`

Add new variables with `mutate()`

Add new variables with `mutate()`

Add new variables with `mutate()`

Add new variables with `mutate()`

1. `lead()` and `lag()`

Add new variables with `mutate()`

2. `ifelse()`

Add new variables with `mutate()`

Add new variables with `mutate()`

Columns: `select()`, `rename()`, `relocate()`, and `mutate()`

Add new variables with `mutate()`

Add new variables with `mutate()`

`ggplot` with a factor variable - Sorted bar chart/dot plot

Add new variables with `mutate()`

Columns: `select()`, `rename()`, `relocate()`, and `mutate()`