Lecture 16

Filtering observations with filter()

Byeong-Hak Choe

SUNY Geneseo

April 2, 2024

Data Transformation

dplyr basics

  • We will continue to discuss the five key dplyr functions to solve various data manipulation challenges:
    • Filter observations by logical conditions about values of variables (filter()).
    • Arrange/sort rows (arrange()).
    • Select variables by their names (select()).
    • Rename variables by their names (rename()).
    • Create new variables with functions of existing variables (mutate()).
    • Relocate existing variables by their names (relocate()).
    • Collapse a data.frame down to a summarized version of it (summarize()).
    • Group a data.frame by a categorical variable (group_by()).

Data Transformation

  • Because the first argument is a data.frame and the output is a data.frame, dplyr verbs work well with the pipe, |>
    • Ctrl + Shift + M for Windows; command + Shift + M for Mac.
  • The pipe (|>) takes the thing on its left and passes it along to the function on its right so that
    • f(x, y) is equivalent to x |> f(y).
    • e.g., filter(DATA_FRAME, LOGICAL_STATEMENT) is equivalent to DATA_FRAME |> filter(LOGICAL_STATEMENT).
  • The easiest way to pronounce the pipe (|>) is “then”.
    • The pipe (|>) is super useful when we have a chain of data transforming operations to do.

Data Transformation

  • To use the (native) pipe operator (|>), we should set the option as follows:

    • Tools > Global Options > Code from the side menu > Choose “Use native pipe operator, |>”.

Data Transformation

dplyr basics

  • DATA_FRAME |> filter(LOGICAL_CONDITIONS)

  • DATA_FRAME |> arrange(VARIABLES)

  • DATA_FRAME |> select(VARIABLES)

  • DATA_FRAME |> rename(NEW_VARIABLE = EXISTING_VARIABLE)

  • DATA_FRAME |> mutate(NEW_VARIABLE = ... )

  • DATA_FRAME |> relocate(VARIABLES)

  • DATA_FRAME |> group_by(VARIABLES)

  • DATA_FRAME |> summarize(NEW_VARIABLE = ...)

  • The subsequent arguments describe what to do with the data.frame, mostly using the variable names.

  • The result is a data.frame.

Filter observations with filter()

Filter observations with filter()

jan1 <- flights |> 
  filter(month == 1, day == 1)

dec25 <- flights |> 
  filter(month == 12, day == 25)

class(flights$month == 1)
  • filter() allows us to subset observations based on the value of logical conditions, which are either TRUE or FALSE.

Filter observations with filter()

Logicals and Conditions

  • Logical variables have either TRUE or FALSE value.
  • Conditions are expressions that evaluate as logical
  • What logical operations do is combining logical conditions, which returns a logical value when executed.

Filter observations with filter()

logical conditions

Filter observations with filter()

Boolean Operations

  • x is the left-hand circle, y is the right-hand circle, and the shaded region show which parts each operator selects.

Filter observations with filter()

logical conditions

df <- data.frame(
  num = c(8, 9, 10, 11),
  chr = c("A", "C", "B", "A"))
df |> filter(num > 8 & 
                num < 11)


Filter observations with filter()

logical conditions

df <- data.frame(
  num = c(8, 9, 10, 11),
  chr = c("A", "C", "B", "A"))
df |> filter(num < 10 & 
                chr == "A")

df |> filter(num < 10, 
              chr == "A")


Filter observations with filter()

logical conditions

df <- data.frame(
  num = c(8, 9, 10, 11),
  chr = c("A", "C", "B", "A"))
df |> filter(num < 10 | 
                chr == "A")


Filter observations with filter()

De Morgan’s law

flights |> 
  filter( !( arr_delay > 120 | 
              dep_delay > 120) )
flights |> 
  filter( arr_delay <= 120 & 
            dep_delay <= 120 )
  • !(x & y) is the same as !x | !y.
  • !(x | y) is the same as !x & !y.

Filter observations with filter()

%in% operator

  • When the or operator | is repeatedly used, we can consider using the %in% operator instead.
flights |> 
  filter( month == 10 | 
            month == 11 | 
            month == 12 )
flights |> 
  filter(month %in% c(10, 11, 12))

Filter observations with filter()

Missing values (NA)

  • Almost any operation involving an unknown value (NA) will also be unknown.
NA > 5
10 == NA
NA + 10
NA / 2

NA == NA
  • Let x be Mary’s age. We don’t know how old she is.
  • Let y be John’s age. We don’t know how old he is.
  • Are John and Mary the same age?
x <- NA
y <- NA
x == y
  • If we want to determine if a value is missing, use is.na().
  • If we want to preserve missing values, ask filter() for them explicitly.
x <- NA
is.na(x) # is x NA?

y <- "missing"
is.na(y) # is y NA?
df <- data.frame(y = c(1, NA, 3))

df |> 
  filter(y > 1)

df |> 
  filter( is.na(y) | y > 1 )

Find all unique observations with distinct()

Find all unique observations with distinct()

# Remove duplicate observations, 
#   if any
flights |> 
  distinct()
# Find all unique 
#   origin and destination pairs
flights |> 
  distinct(origin, dest)

# If we want to keep other variables
flights |> 
  distinct(origin, dest, 
           .keep_all = TRUE)
  • distinct() finds all the unique observations in a data.frame.
    • We can also optionally provide variable names to distinct().