Lecture 17

Filtering observations with filter(); Arrange observations with arrange()

Byeong-Hak Choe

bchoe@geneseo.edu

SUNY Geneseo

April 4, 2024

Data Transformation

Because the first argument is a data.frame and the output is a data.frame, dplyr verbs work well with the pipe, |>
- Ctrl + Shift + M for Windows; command + Shift + M for Mac.
The pipe (|>) takes the thing on its left and passes it along to the function on its right so that
- f(x, y) is equivalent to x |> f(y).
- e.g., filter(DATA_FRAME, LOGICAL_STATEMENT) is equivalent to DATA_FRAME |> filter(LOGICAL_STATEMENT).
The easiest way to pronounce the pipe (|>) is “then”.
- The pipe (|>) is super useful when we have a chain of data transforming operations to do.

Data Transformation

To use the (native) pipe operator (|>), we should set the option as follows:
- Tools > Global Options > Code from the side menu > Choose “Use native pipe operator, |>”.

Data Transformation

`dplyr` basics

DATA_FRAME |> filter(LOGICAL_CONDITIONS)
DATA_FRAME |> arrange(VARIABLES)
DATA_FRAME |> select(VARIABLES)
DATA_FRAME |> rename(NEW_VARIABLE = EXISTING_VARIABLE)
DATA_FRAME |> mutate(NEW_VARIABLE = ... )
DATA_FRAME |> relocate(VARIABLES)
DATA_FRAME |> group_by(VARIABLES)
DATA_FRAME |> summarize(NEW_VARIABLE = ...)
The subsequent arguments describe what to do with the data.frame, mostly using the variable names.
The result is a data.frame.

Filter observations with `filter()`

jan1 <- flights |> 
  filter(month == 1, day == 1)

dec25 <- flights |> 
  filter(month == 12, day == 25)

class(flights$month == 1)

filter() allows us to subset observations based on the value of logical conditions, which are either TRUE or FALSE.

Filter observations with `filter()`

Logicals and Conditions

Logical variables have either TRUE or FALSE value.
Conditions are expressions that evaluate as logical
What logical operations do is combining logical conditions, which returns a logical value when executed.

Filter observations with `filter()`

`logical` conditions

Filter observations with `filter()`

Boolean Operations

x is the left-hand circle, y is the right-hand circle, and the shaded region show which parts each operator selects.

Filter observations with `filter()`

`logical` conditions

df <- data.frame(
  num = c(8, 9, 10, 11),
  chr = c("A", "C", "B", "A"))

df |> filter(num > 8 & 
                num < 11)
                
df |> filter(num > 8,
             num < 11)

Filter observations with `filter()`

`logical` conditions

df <- data.frame(
  num = c(8, 9, 10, 11),
  chr = c("A", "C", "B", "A"))

df |> filter(num < 10 & 
             chr == "A")

df |> filter(num < 10, 
             chr == "A")

Filter observations with `filter()`

`logical` conditions

df <- data.frame(
  num = c(8, 9, 10, 11),
  chr = c("A", "C", "B", "A"))

df |> filter(num < 10 | 
                chr == "A")

Filter observations with `filter()`

De Morgan’s law

flights |> 
  filter( !( arr_delay > 120 | 
              dep_delay > 120) )

flights |> 
  filter( arr_delay <= 120 & 
            dep_delay <= 120 )

!(x & y) is the same as !x | !y.
!(x | y) is the same as !x & !y.

Filter observations with `filter()`

`%in%` operator

When the or operator | is repeatedly used, we can consider using the %in% operator instead.

flights |> 
  filter( month == 10 | 
            month == 11 | 
            month == 12 )

flights |> 
  filter(month %in% c(10, 11, 12))

Filter observations with `filter()`

Almost any operation involving an unknown value (NA) will also be unknown.

NA > 5
10 == NA
NA + 10
NA / 2

NA == NA

Let x be Mary’s age. We don’t know how old she is.
Let y be John’s age. We don’t know how old he is.
Are John and Mary the same age?

x <- NA
y <- NA
x == y

If we want to determine if a value is missing, use is.na().
If we want to preserve missing values, ask filter() for them explicitly.

x <- NA
is.na(x) # is x NA?

y <- "missing"
is.na(y) # is y NA?

df <- data.frame(y = c(1, NA, 3))

df |> 
  filter(y > 1)

df |> 
  filter( is.na(y) | y > 1 )

Find all unique observations with `distinct()`

# Remove duplicate observations, 
#  if any
flights |> 
  distinct()

# Find all unique 
#  origin and destination pairs
flights |> 
  distinct(origin, dest)

# If we want to keep other variables
flights |> 
  distinct(origin, dest, 
           .keep_all = TRUE)

distinct() finds all the unique observations in a data.frame.
- We can also optionally provide variable names to distinct().

Arrange observations with `arrange()`

flights %>% 
  arrange(year, month, day)

# re-order observations by `dep_delay` in descending order.
flights %>% 
  arrange([?])

arrange() sorts out observations.
- If we provide more than one variable name, each additional variable will be used to break ties in the values of preceding variables.
Use desc() to re-order by a column in descending order.
- Adding - before a numeric variable (-NUMERIC_VARIABLE) also works.

Rows: `filter()`, `distinct()`, and `arrange()`

Let’s do Classwork 9!

Lecture 17

Data Transformation

Data Transformation

Data Transformation

dplyr basics

Filter observations with filter()

Filter observations with filter()

Filter observations with filter()

Logicals and Conditions

Filter observations with filter()

logical conditions

Filter observations with filter()

Boolean Operations

Filter observations with filter()

logical conditions

Filter observations with filter()

logical conditions

Filter observations with filter()

logical conditions

Filter observations with filter()

De Morgan’s law

Filter observations with filter()

%in% operator

Filter observations with filter()

Missing values (NA)

Find all unique observations with distinct()

Find all unique observations with distinct()

Arrange observations with arrange()

Arrange observations with arrange()

Rows: filter(), distinct(), and arrange()

`dplyr` basics

Filter observations with `filter()`

Filter observations with `filter()`

Filter observations with `filter()`

Filter observations with `filter()`

`logical` conditions

Filter observations with `filter()`

Filter observations with `filter()`

`logical` conditions

Filter observations with `filter()`

`logical` conditions

Filter observations with `filter()`

`logical` conditions

Filter observations with `filter()`

Filter observations with `filter()`

`%in%` operator

Filter observations with `filter()`

Missing values (`NA`)

Find all unique observations with `distinct()`

Find all unique observations with `distinct()`

Arrange observations with `arrange()`

Arrange observations with `arrange()`

Rows: `filter()`, `distinct()`, and `arrange()`