Lecture 15

Data Preparation and Management with R

Byeong-Hak Choe

SUNY Geneseo

October 4, 2024

Data Transformation with R tidyverse

Data Transformation

  • DATA.FRAME |> filter(LOGICAL_CONDITIONS)

  • DATA.FRAME |> arrange(VARIABLES)

  • DATA.FRAME |> distinct(VARIABLES)

  • DATA.FRAME |> select(VARIABLES)

  • DATA.FRAME |> rename(NEW_VARIABLE = EXISTING_VARIABLE)

  • The subsequent arguments describe what to do with the data.frame, mostly using the variable names.

  • The result is a data.frame.

Filter observations with filter()

Filter observations with filter()

Missing values (NA)

  • Almost any operation involving an unknown value (NA) will also be unknown.
NA > 5
10 == NA
NA + 10
NA / 2
(1 + NA + 3) / 3
mean( c(1, NA, 3) )
sd( c(1, NA, 3) )

Filter observations with filter()

Missing values (NA)

  • Let x be Mary’s age. We don’t know how old she is.
  • Let y be John’s age. We don’t know how old he is.
  • Are John and Mary the same age?
x <- NA
y <- NA
x == y

Filter observations with filter()

is.na()

  • If we want to determine if a value is missing, use is.na().
  • If we want to preserve missing values, ask filter() for them explicitly.
x <- NA
is.na(x) # is x NA?

y <- "missing"
is.na(y) # is y NA?

v1 <- c(1, NA, 3)
is.na(v1) # is v1 NA?
df <- data.frame(v1 = c(1, NA, 3),
                 v2 = c(1, 2, 3))

df |> 
  filter( is.na(v1) )

df |> 
  filter( !is.na(v1) )