Data Transformation with R
September 29, 2025
data.frame
getwd()
in the Console.
/cloud/project/
custdata_rev.csv
:
/Users/user/documents/data/custdata_rev.csv
C:\\Users\\user\\Documents\\data\\custdata_rev.csv
\\
) because a single backslash (\
) is treated as a special character in R.custdata_rev.csv
:
/cloud/project/data/custdata_rev.csv
/cloud/project/
data/custdata_rev.csv
data.frame
custdata_rev.csv
from the Class Files module in Brightspace.
custdata_rev.csv
file into it.
view(DF)
(or View(DF)
) opens a DF
in a spreadsheet-like viewer.# Read the CSV file directly from the web (GitHub repo)
custdata_web <- read_csv(
'https://bcdanl.github.io/data/custdata_rev.csv')
data.frame
$
operator extracts a single column from a data.frame
as a vector
.dim()
shows both the number of rows and columns.nrow()
and ncol()
give the row count and column count separately.summary()
gives a quick overview, while skimr::skim()
provides a more detailed, user-friendly summary of variables across all data types.data.frame
data.frame
represent individual units or entities for which data is collected.data.frame
data.frame
represent attributes or characteristics measured across multiple observations.Name
, Age
, Grade
, Major
EmployeeID
, Name
, Age
, Department
CustomerID
, Name
, Age
, Income
, HousingType
Note
data.frame
, a variable is a column of data.data.frame
A data.frame
is tidy if it follows three rules:
A tidy data.frame
keeps your data organized, making it easier to understand, analyze, and share in any data analysis.
dplyr
dplyr
dplyr
is a core tidyverse package for data manipulation — tasks like filtering, sorting, selecting, and renaming.dplyr
functions with data.frame DF
:
filter(DF, LOGICAL_CONDITIONS)
arrange(DF, VARIABLES)
distinct(DF, VARIABLES)
select(DF, VARIABLES)
rename(DF, NEW_NAME = CURRENT_NAME)
dplyr
functions take a data.frame as the first argument.
dplyr
Code Flow with the Pipe Operator|>
or %>%
) makes code easier to read by connecting steps in order.f(x, y)
is the same as x |> f(y)
.DF
:
filter(DF, logical_condition)
DF |> filter(logical_condition)
dplyr
functions with the pipe operator:
DF |> filter(LOGICAL_CONDITIONS)
DF |> arrange(VARIABLES)
DF |> distinct(VARIABLES)
DF |> select(VARIABLES)
DF |> rename(NEW_NAME = CURRENT_NAME)
dplyr
functions usually take a data.frame as the first argument.DF |> filter(...) |> arrange(...)
|>
) in RStudio:
|>
).filter()
filter()
install.packages("nycflights13") # Install once
library(nycflights13)
library(tidyverse)
flights <- nycflights13::flights
flights$month == 1 # A logical test returns TRUE or FALSE
class(flights$month == 12)
filter()
keeps only the observations that meet one or more logical conditions.
TRUE
(keep the observation) or FALSE
(drop the observation).V1
and V2
are variables, and the comparisons are applied element-wise (vectorized).V1
and V2
are integer
or numeric
.x
and y
are logical
conditions/variables.&
, |
, !
) do is combining logical variables/conditions, which returns a logical
variable when executed.x
and y
are logical conditions.
x
is TRUE
, it highlights the left circle.y
is TRUE
, it highlights the right circle.
filter()
, separating conditions with a comma is equivalent to combining them with the &
operator.NA
)NA
(not available) represents a missing or unknown value in R.NA
).v1
is Mary’s age (unknown) and v2
is John’s age (unknown).NA
.is.na()
is.na()
to test whether a value is missing (NA
).filter()
, you can:
is.na()
to keep observations with missing values.!is.na()
to remove observations with missing values.arrange()
arrange()
arrange()
sorts out observations.arrange()
sorts by the first variable, and then uses the next variable(s) to break ties.desc()
desc(VARIABLE)
to sort in descending order.
arrange()
- Exampledistinct()
distinct()
distinct()
removes duplicate observations.
distinct()
.
select()
select()
It’s not uncommon to get datasets with hundreds or thousands of variables.
select()
allows us to narrow in on the variables we’re actually interested in.
We can select variables by their names.
select()
select(-VARIABLES)
, we can remove variables.rename()
rename()
rename()
can be used to rename variables:
DF |> rename(NEW_NAME = CURRENT_NAME)