Lecture 14

Data Preparation and Management with R

Byeong-Hak Choe

SUNY Geneseo

September 30, 2024

Joining Relational data.frames with left_join()

Pipe (|>) Operator

  • tidyverse functions work well with the pipe, |>, because the first argument of a tidyverse function is a data.frame and the output is a data.frame.
    • Ctrl + Shift + M for Windows; command + Shift + M for Mac.
  • The pipe (|>) takes the thing on its left and passes it along to the function on its right so that
    • f(x, y) is equivalent to x |> f(y).
    • e.g., left_join(DATA.FRAME_1, DATA.FRAME_2) is equivalent to DATA.FRAME_1 |> left_join(DATA.FRAME_2).
  • The easiest way to pronounce the pipe (|>) is “then”.
    • The pipe (|>) is super useful when we have a chain of data transforming operations to do.

Pipe (|>) Operator

  • To use the (native) pipe operator (|>), we should set the option as follows:

    • Tools > Global Options > Code from the side menu > Choose “Use native pipe operator, |>”.

Joining Relational data.frames with left_join()

x <- data.frame(
    key = c(1, 2, 3),
    val_x = c('x1', 'x2', 'x3')
)
y <- data.frame(
    key = c(1, 2, 4),
    val_y = c('y1', 'y2', 'y3')
)
  • The colored column represents the “key” variable.
  • The grey column represents the “value” column.

Relational Data

x |>
  left_join(y)
  • A left join keeps all observations in x.
    • The left join is most commonly used.
  • Note: NA stands for “not available” (i.e., a missing value).

Data Transformation with R tidyverse

Data Transformation

  • DATA.FRAME |> filter(LOGICAL_CONDITIONS)

  • DATA.FRAME |> arrange(VARIABLES)

  • DATA.FRAME |> select(VARIABLES)

  • DATA.FRAME |> rename(NEW_VARIABLE = EXISTING_VARIABLE)

  • DATA.FRAME |> mutate(NEW_VARIABLE = ... )

  • The subsequent arguments describe what to do with the data.frame, mostly using the variable names.

  • The result is a data.frame.

Filter observations with filter()

Filter observations with filter()

jan1 <- flights |> 
  filter(month == 1, day == 1)

dec25 <- flights |> 
  filter(month == 12, day == 25)

class(flights$month == 1)
  • filter() allows us to subset observations based on the value of logical conditions, which are either TRUE or FALSE.

Filter observations with filter()

Logicals and Conditions

  • Logical variables have either TRUE or FALSE value.
  • Conditions are expressions that evaluate as logical.
  • What logical operations do is combining logical conditions, which returns a logical value when executed.

Filter observations with filter()

Logical Operations

  • x is the left-hand circle, y is the right-hand circle, and the shaded region show which parts each operator selects.

Filter observations with filter()

logical conditions with equality and inequality

Filter observations with filter()

logical conditions

df <- data.frame(
  num = c(8, 9, 10, 11),
  chr = c("A", "C", "B", "A"))
df |> filter(num > 8 & 
             num < 11)
                
df |> filter(num > 8,
             num < 11)


Filter observations with filter()

logical conditions

df <- data.frame(
  num = c(8, 9, 10, 11),
  chr = c("A", "C", "B", "A"))
df |> filter(num < 10 & 
             chr == "A")

df |> filter(num < 10, 
             chr == "A")


Filter observations with filter()

logical conditions

df <- data.frame(
  num = c(8, 9, 10, 11),
  chr = c("A", "C", "B", "A"))
df |> filter(num < 10 | 
             chr == "A")