Filtering observations with filter()
April 2, 2024
dplyr
basicsdplyr
functions to solve various data manipulation challenges:
filter()
).arrange()
).select()
).rename()
).mutate()
).relocate()
).summarize()
).group_by()
).data.frame
and the output is a data.frame
, dplyr
verbs work well with the pipe, |>
|>
) takes the thing on its left and passes it along to the function on its right so that
f(x, y)
is equivalent to x |> f(y)
.filter(DATA_FRAME, LOGICAL_STATEMENT)
is equivalent to DATA_FRAME |> filter(LOGICAL_STATEMENT)
.|>
) is “then”.
|>
) is super useful when we have a chain of data transforming operations to do.To use the (native) pipe operator (|>
), we should set the option as follows:
dplyr
basicsDATA_FRAME |> filter(LOGICAL_CONDITIONS)
DATA_FRAME |> arrange(VARIABLES)
DATA_FRAME |> select(VARIABLES)
DATA_FRAME |> rename(NEW_VARIABLE = EXISTING_VARIABLE)
DATA_FRAME |> mutate(NEW_VARIABLE = ... )
DATA_FRAME |> relocate(VARIABLES)
DATA_FRAME |> group_by(VARIABLES)
DATA_FRAME |> summarize(NEW_VARIABLE = ...)
The subsequent arguments describe what to do with the data.frame, mostly using the variable names.
The result is a data.frame.
filter()
filter()
jan1 <- flights |>
filter(month == 1, day == 1)
dec25 <- flights |>
filter(month == 12, day == 25)
class(flights$month == 1)
filter()
allows us to subset observations based on the value of logical conditions, which are either TRUE
or FALSE
.filter()
TRUE
or FALSE
value.logical
filter()
logical
conditionsfilter()
x
is the left-hand circle, y
is the right-hand circle, and the shaded region show which parts each operator selects.filter()
logical
conditionsfilter()
logical
conditionsfilter()
logical
conditionsfilter()
!(x & y)
is the same as !x | !y
.!(x | y)
is the same as !x & !y
.filter()
%in%
operator|
is repeatedly used, we can consider using the %in%
operator instead.filter()
NA
)NA
) will also be unknown.x
be Mary’s age. We don’t know how old she is.y
be John’s age. We don’t know how old he is.distinct()
distinct()
distinct()
finds all the unique observations in a data.frame.
distinct()
.