Lecture 16

Filtering observations with filter()

Byeong-Hak Choe

bchoe@geneseo.edu

SUNY Geneseo

April 2, 2024

Data Transformation

`dplyr` basics

We will continue to discuss the five key dplyr functions to solve various data manipulation challenges:
- Filter observations by logical conditions about values of variables (filter()).
- Arrange/sort rows (arrange()).
- Select variables by their names (select()).
- Rename variables by their names (rename()).
- Create new variables with functions of existing variables (mutate()).
- Relocate existing variables by their names (relocate()).
- Collapse a data.frame down to a summarized version of it (summarize()).
- Group a data.frame by a categorical variable (group_by()).

Data Transformation

Because the first argument is a data.frame and the output is a data.frame, dplyr verbs work well with the pipe, |>
- Ctrl + Shift + M for Windows; command + Shift + M for Mac.
The pipe (|>) takes the thing on its left and passes it along to the function on its right so that
- f(x, y) is equivalent to x |> f(y).
- e.g., filter(DATA_FRAME, LOGICAL_STATEMENT) is equivalent to DATA_FRAME |> filter(LOGICAL_STATEMENT).
The easiest way to pronounce the pipe (|>) is “then”.
- The pipe (|>) is super useful when we have a chain of data transforming operations to do.

Data Transformation

To use the (native) pipe operator (|>), we should set the option as follows:
- Tools > Global Options > Code from the side menu > Choose “Use native pipe operator, |>”.

Data Transformation

`dplyr` basics

DATA_FRAME |> filter(LOGICAL_CONDITIONS)
DATA_FRAME |> arrange(VARIABLES)
DATA_FRAME |> select(VARIABLES)
DATA_FRAME |> rename(NEW_VARIABLE = EXISTING_VARIABLE)
DATA_FRAME |> mutate(NEW_VARIABLE = ... )
DATA_FRAME |> relocate(VARIABLES)
DATA_FRAME |> group_by(VARIABLES)
DATA_FRAME |> summarize(NEW_VARIABLE = ...)
The subsequent arguments describe what to do with the data.frame, mostly using the variable names.
The result is a data.frame.

Filter observations with `filter()`

jan1 <- flights |> 
  filter(month == 1, day == 1)

dec25 <- flights |> 
  filter(month == 12, day == 25)

class(flights$month == 1)

filter() allows us to subset observations based on the value of logical conditions, which are either TRUE or FALSE.

Filter observations with `filter()`

Logicals and Conditions

Logical variables have either TRUE or FALSE value.
Conditions are expressions that evaluate as logical
What logical operations do is combining logical conditions, which returns a logical value when executed.

Filter observations with `filter()`

`logical` conditions

Filter observations with `filter()`

Boolean Operations

x is the left-hand circle, y is the right-hand circle, and the shaded region show which parts each operator selects.

Filter observations with `filter()`

`logical` conditions

df <- data.frame(
  num = c(8, 9, 10, 11),
  chr = c("A", "C", "B", "A"))

df |> filter(num > 8 & 
                num < 11)

Filter observations with `filter()`

`logical` conditions

df <- data.frame(
  num = c(8, 9, 10, 11),
  chr = c("A", "C", "B", "A"))

df |> filter(num < 10 & 
                chr == "A")

df |> filter(num < 10, 
              chr == "A")

Filter observations with `filter()`

`logical` conditions

df <- data.frame(
  num = c(8, 9, 10, 11),
  chr = c("A", "C", "B", "A"))

df |> filter(num < 10 | 
                chr == "A")

Filter observations with `filter()`

De Morgan’s law

flights |> 
  filter( !( arr_delay > 120 | 
              dep_delay > 120) )

flights |> 
  filter( arr_delay <= 120 & 
            dep_delay <= 120 )

!(x & y) is the same as !x | !y.
!(x | y) is the same as !x & !y.

Filter observations with `filter()`

`%in%` operator

When the or operator | is repeatedly used, we can consider using the %in% operator instead.

flights |> 
  filter( month == 10 | 
            month == 11 | 
            month == 12 )

flights |> 
  filter(month %in% c(10, 11, 12))

Filter observations with `filter()`

Almost any operation involving an unknown value (NA) will also be unknown.

NA > 5
10 == NA
NA + 10
NA / 2

NA == NA

Let x be Mary’s age. We don’t know how old she is.
Let y be John’s age. We don’t know how old he is.
Are John and Mary the same age?

x <- NA
y <- NA
x == y

If we want to determine if a value is missing, use is.na().
If we want to preserve missing values, ask filter() for them explicitly.

x <- NA
is.na(x) # is x NA?

y <- "missing"
is.na(y) # is y NA?

df <- data.frame(y = c(1, NA, 3))

df |> 
  filter(y > 1)

df |> 
  filter( is.na(y) | y > 1 )

Find all unique observations with `distinct()`

# Remove duplicate observations, 
#   if any
flights |> 
  distinct()

# Find all unique 
#   origin and destination pairs
flights |> 
  distinct(origin, dest)

# If we want to keep other variables
flights |> 
  distinct(origin, dest, 
           .keep_all = TRUE)

distinct() finds all the unique observations in a data.frame.
- We can also optionally provide variable names to distinct().

Lecture 16

Data Transformation

dplyr basics

Data Transformation

Data Transformation

Data Transformation

dplyr basics

Filter observations with filter()

Filter observations with filter()

Filter observations with filter()

Logicals and Conditions

Filter observations with filter()

logical conditions

Filter observations with filter()

Boolean Operations

Filter observations with filter()

logical conditions

Filter observations with filter()

logical conditions

Filter observations with filter()

logical conditions

Filter observations with filter()

De Morgan’s law

Filter observations with filter()

%in% operator

Filter observations with filter()

Missing values (NA)

Find all unique observations with distinct()

Find all unique observations with distinct()

`dplyr` basics

`dplyr` basics

Filter observations with `filter()`

Filter observations with `filter()`

Filter observations with `filter()`

Filter observations with `filter()`

`logical` conditions

Filter observations with `filter()`

Filter observations with `filter()`

`logical` conditions

Filter observations with `filter()`

`logical` conditions

Filter observations with `filter()`

`logical` conditions

Filter observations with `filter()`

Filter observations with `filter()`

`%in%` operator

Filter observations with `filter()`

Missing values (`NA`)

Find all unique observations with `distinct()`

Find all unique observations with `distinct()`