Lecture 17

Filtering observations with filter(); Arrange observations with arrange()

Byeong-Hak Choe

SUNY Geneseo

April 4, 2024

Data Transformation

  • Because the first argument is a data.frame and the output is a data.frame, dplyr verbs work well with the pipe, |>
    • Ctrl + Shift + M for Windows; command + Shift + M for Mac.
  • The pipe (|>) takes the thing on its left and passes it along to the function on its right so that
    • f(x, y) is equivalent to x |> f(y).
    • e.g., filter(DATA_FRAME, LOGICAL_STATEMENT) is equivalent to DATA_FRAME |> filter(LOGICAL_STATEMENT).
  • The easiest way to pronounce the pipe (|>) is “then”.
    • The pipe (|>) is super useful when we have a chain of data transforming operations to do.

Data Transformation

  • To use the (native) pipe operator (|>), we should set the option as follows:

    • Tools > Global Options > Code from the side menu > Choose “Use native pipe operator, |>”.

Data Transformation

dplyr basics

  • DATA_FRAME |> filter(LOGICAL_CONDITIONS)

  • DATA_FRAME |> arrange(VARIABLES)

  • DATA_FRAME |> select(VARIABLES)

  • DATA_FRAME |> rename(NEW_VARIABLE = EXISTING_VARIABLE)

  • DATA_FRAME |> mutate(NEW_VARIABLE = ... )

  • DATA_FRAME |> relocate(VARIABLES)

  • DATA_FRAME |> group_by(VARIABLES)

  • DATA_FRAME |> summarize(NEW_VARIABLE = ...)

  • The subsequent arguments describe what to do with the data.frame, mostly using the variable names.

  • The result is a data.frame.

Filter observations with filter()

Filter observations with filter()

jan1 <- flights |> 
  filter(month == 1, day == 1)

dec25 <- flights |> 
  filter(month == 12, day == 25)

class(flights$month == 1)
  • filter() allows us to subset observations based on the value of logical conditions, which are either TRUE or FALSE.

Filter observations with filter()

Logicals and Conditions

  • Logical variables have either TRUE or FALSE value.
  • Conditions are expressions that evaluate as logical
  • What logical operations do is combining logical conditions, which returns a logical value when executed.

Filter observations with filter()

logical conditions

Filter observations with filter()

Boolean Operations

  • x is the left-hand circle, y is the right-hand circle, and the shaded region show which parts each operator selects.

Filter observations with filter()

logical conditions

df <- data.frame(
  num = c(8, 9, 10, 11),
  chr = c("A", "C", "B", "A"))
df |> filter(num > 8 & 
                num < 11)
                
df |> filter(num > 8,
             num < 11)


Filter observations with filter()

logical conditions

df <- data.frame(
  num = c(8, 9, 10, 11),
  chr = c("A", "C", "B", "A"))
df |> filter(num < 10 & 
             chr == "A")

df |> filter(num < 10, 
             chr == "A")


Filter observations with filter()

logical conditions

df <- data.frame(
  num = c(8, 9, 10, 11),
  chr = c("A", "C", "B", "A"))
df |> filter(num < 10 | 
                chr == "A")


Filter observations with filter()

De Morgan’s law

flights |> 
  filter( !( arr_delay > 120 | 
              dep_delay > 120) )
flights |> 
  filter( arr_delay <= 120 & 
            dep_delay <= 120 )
  • !(x & y) is the same as !x | !y.
  • !(x | y) is the same as !x & !y.

Filter observations with filter()

%in% operator

  • When the or operator | is repeatedly used, we can consider using the %in% operator instead.
flights |> 
  filter( month == 10 | 
            month == 11 | 
            month == 12 )
flights |> 
  filter(month %in% c(10, 11, 12))

Filter observations with filter()

Missing values (NA)

  • Almost any operation involving an unknown value (NA) will also be unknown.
NA > 5
10 == NA
NA + 10
NA / 2

NA == NA
  • Let x be Mary’s age. We don’t know how old she is.
  • Let y be John’s age. We don’t know how old he is.
  • Are John and Mary the same age?
x <- NA
y <- NA
x == y
  • If we want to determine if a value is missing, use is.na().
  • If we want to preserve missing values, ask filter() for them explicitly.
x <- NA
is.na(x) # is x NA?

y <- "missing"
is.na(y) # is y NA?
df <- data.frame(y = c(1, NA, 3))

df |> 
  filter(y > 1)

df |> 
  filter( is.na(y) | y > 1 )

Find all unique observations with distinct()

Find all unique observations with distinct()

# Remove duplicate observations, 
#  if any
flights |> 
  distinct()
# Find all unique 
#  origin and destination pairs
flights |> 
  distinct(origin, dest)

# If we want to keep other variables
flights |> 
  distinct(origin, dest, 
           .keep_all = TRUE)
  • distinct() finds all the unique observations in a data.frame.
    • We can also optionally provide variable names to distinct().

Arrange observations with arrange()

Arrange observations with arrange()

flights %>% 
  arrange(year, month, day)

# re-order observations by `dep_delay` in descending order.
flights %>% 
  arrange([?])
  • arrange() sorts out observations.
    • If we provide more than one variable name, each additional variable will be used to break ties in the values of preceding variables.
  • Use desc() to re-order by a column in descending order.
    • Adding - before a numeric variable (-NUMERIC_VARIABLE) also works.

Rows: filter(), distinct(), and arrange()

Let’s do Classwork 9!