Data Preparation and Management with R
October 3, 2024
tidyverse
DATA.FRAME |> filter(LOGICAL_CONDITIONS)
DATA.FRAME |> arrange(VARIABLES)
DATA.FRAME |> distinct(VARIABLES)
DATA.FRAME |> select(VARIABLES)
DATA.FRAME |> rename(NEW_VARIABLE = EXISTING_VARIABLE)
The subsequent arguments describe what to do with the data.frame, mostly using the variable names.
The result is a data.frame.
filter()
filter()
NA
)filter()
NA
)x
be Mary’s age. We don’t know how old she is.y
be John’s age. We don’t know how old he is.filter()
is.na()
is.na()
.filter()
for them explicitly.arrange()
arrange()
arrange()
sorts out observations.arrange()
desc()
# arrange observations by `dep_delay` in descending order.
flights |>
arrange(desc(dep_delay))
flights |>
arrange(-dep_delay)
desc(VARIABLE)
to re-order by a VARIABLE
in descending order.
-
before a numeric variable (-NUMERIC_VARIABLE
) also works.arrange()
distinct()
distinct()
distinct()
can find all the unique observations in a data.frame.distinct()
distinct()
.select()
select()
It’s not uncommon to get datasets with hundreds or thousands of variables.
select()
allows us to narrow in on the variables we’re actually interested in.
We can select variables by their names.
select()
select(-VARIABLES)
, we can remove variables.rename()
rename()
rename()
can be used to rename variables:
DATA_FRAME |> rename(NEW_VARIABLE = EXISTING_VARIABLE)