Lecture 16

Data Preparation and Management with R

Byeong-Hak Choe

SUNY Geneseo

October 7, 2024

Arrange observations with arrange()

Arrange observations with arrange()

# arrange observations by `dep_delay` in ascending order.
flights |> 
  arrange(dep_delay)
  • arrange() sorts out observations.

Arrange observations with arrange()

Descending order with desc()

# arrange observations by `dep_delay` in descending order.
flights |> 
  arrange(desc(dep_delay))
  
flights |> 
  arrange(-dep_delay)
  • Use desc(VARIABLE) to re-order by a VARIABLE in descending order.
    • Adding - before a numeric variable (-NUMERIC_VARIABLE) also works.

Arrange observations with arrange()

df <- data.frame(
  year = c(2024, 2021, 2024, 2024),
  month = c(7, 10, 7, 4),
  day = c(20, 19, 15, 9)
)
df |> 
  arrange(year, month, day)

  • If we provide more than one variable name, each additional variable will be used to break ties in the values of preceding variables.

Find all unique observations with distinct()

Find all unique observations with distinct()

df <- data.frame(
  v1 = c("USA", "Korea", "USA"),
  v2 = c("D.C.", "Seoul", "D.C.") 
  v3 = c("Georgetown", "Gangnam", 
                       "Georgetown") 
  )

# Remove duplicate observations
df |> 
  distinct()
# Remove duplicate observations, 
# if any

flights |> 
  distinct()
  • distinct() can find all the unique observations in a data.frame.

Find all unique observations with distinct()

# Find all unique 
#  origin and destination pairs
flights |> 
  distinct(origin, dest)
  • We can also provide variable names to distinct().

Select variables with select()

Select variables with select()

Basic

  • It’s not uncommon to get datasets with hundreds or thousands of variables.

  • select() allows us to narrow in on the variables we’re actually interested in.

  • We can select variables by their names.

flights |> 
  select(year, month, day)

Select variables with select()

Removal

flights |> 
  select(-year)
  • With select(-VARIABLES), we can remove variables.

Rename variables with rename()

Rename variables with rename()

flights |> 
  rename( tail_num = tailnum )
  • rename() can be used to rename variables:

    • DATA_FRAME |> rename(NEW_VARIABLE = EXISTING_VARIABLE)