Data Transformation with R
September 29, 2025
data.framegetwd() in the Console.
/cloud/project/custdata_rev.csv:
/Users/user/documents/data/custdata_rev.csvC:\\Users\\user\\Documents\\data\\custdata_rev.csv
\\) because a single backslash (\) is treated as a special character in R.custdata_rev.csv:
/cloud/project/data/custdata_rev.csv/cloud/project/data/custdata_rev.csvdata.framecustdata_rev.csv from the Class Files module in Brightspace.
custdata_rev.csv file into it.
view(DF) (or View(DF)) opens a DF in a spreadsheet-like viewer.# Read the CSV file directly from the web (GitHub repo)
custdata_web <- read_csv(
'https://bcdanl.github.io/data/custdata_rev.csv')data.frame$ operator extracts a single column from a data.frame as a vector.dim() shows both the number of rows and columns.nrow() and ncol() give the row count and column count separately.summary() gives a quick overview, while skimr::skim() provides a more detailed, user-friendly summary of variables across all data types.data.framedata.frame represent individual units or entities for which data is collected.data.framedata.frame represent attributes or characteristics measured across multiple observations.Name, Age, Grade, MajorEmployeeID, Name, Age, DepartmentCustomerID, Name, Age, Income, HousingTypeNote
data.frame, a variable is a column of data.data.frame
A data.frame is tidy if it follows three rules:
A tidy data.frame keeps your data organized, making it easier to understand, analyze, and share in any data analysis.
dplyrdplyr
dplyr is a core tidyverse package for data manipulation — tasks like filtering, sorting, selecting, and renaming.dplyr functions with data.frame DF:
filter(DF, LOGICAL_CONDITIONS)arrange(DF, VARIABLES)distinct(DF, VARIABLES)select(DF, VARIABLES)rename(DF, NEW_NAME = CURRENT_NAME)dplyr functions take a data.frame as the first argument.
dplyr Code Flow with the Pipe Operator|> or %>%) makes code easier to read by connecting steps in order.f(x, y) is the same as x |> f(y).DF:
filter(DF, logical_condition)DF |> filter(logical_condition)dplyr functions with the pipe operator:
DF |> filter(LOGICAL_CONDITIONS)DF |> arrange(VARIABLES)DF |> distinct(VARIABLES)DF |> select(VARIABLES)DF |> rename(NEW_NAME = CURRENT_NAME)dplyr functions usually take a data.frame as the first argument.DF |> filter(...) |> arrange(...)
|>) in RStudio:
|>).filter()filter()install.packages("nycflights13") # Install once
library(nycflights13)
library(tidyverse)
flights <- nycflights13::flights
flights$month == 1 # A logical test returns TRUE or FALSE
class(flights$month == 12) filter() keeps only the observations that meet one or more logical conditions.
TRUE (keep the observation) or FALSE (drop the observation).V1 and V2 are variables, and the comparisons are applied element-wise (vectorized).V1 and V2 are integer or numeric.x and y are logical conditions/variables.&, |, !) do is combining logical variables/conditions, which returns a logical variable when executed.x and y are logical conditions.
x is TRUE, it highlights the left circle.y is TRUE, it highlights the right circle.
filter(), separating conditions with a comma is equivalent to combining them with the & operator.NA)NA (not available) represents a missing or unknown value in R.NA).v1 is Mary’s age (unknown) and v2 is John’s age (unknown).NA.is.na()is.na() to test whether a value is missing (NA).filter(), you can:
is.na() to keep observations with missing values.!is.na() to remove observations with missing values.arrange()arrange()arrange() sorts out observations.arrange() sorts by the first variable, and then uses the next variable(s) to break ties.desc()desc(VARIABLE) to sort in descending order.
arrange() - Exampledistinct()distinct()distinct() removes duplicate observations.
distinct().
select()select()It’s not uncommon to get datasets with hundreds or thousands of variables.
select() allows us to narrow in on the variables we’re actually interested in.
We can select variables by their names.
select()select(-VARIABLES), we can remove variables.rename()rename()rename() can be used to rename variables:
DF |> rename(NEW_NAME = CURRENT_NAME)