Lecture 6

Data Transformation with R

Byeong-Hak Choe

SUNY Geneseo

September 29, 2025

Reading a Data Table as a data.frame

CSV Files

  • A CSV (comma-separated values) file is a plain text file where each value is separated by a comma.
    • CSV files are widely used for storing data from spreadsheets and databases.
  • Example

Absolute Pathnames

  • An absolute pathname tells the computer the exact location of a file, starting from the very top folder of your computer.
    • This location never changes, no matter where you are working in R.
  • In R, you can see the working directory — the folder where R is currently “looking” for files — by running getwd() in the Console.
    • In a Posit Cloud, the working directory is /cloud/project/
  • Examples of an absolute pathname for custdata_rev.csv:
    • On a Mac:
      /Users/user/documents/data/custdata_rev.csv
    • On Windows:
      C:\\Users\\user\\Documents\\data\\custdata_rev.csv
      • Note: In Windows, we use double backslashes (\\) because a single backslash (\) is treated as a special character in R.

Relative Pathnames

  • A relative pathname specifies the location of a file relative to the working directory.
  • Examples of a relative pathname for custdata_rev.csv:
    • Absolute pathname:
      /cloud/project/data/custdata_rev.csv
    • Working directory:
      /cloud/project/
    • Relative pathname:
      data/custdata_rev.csv
  • In Posit Cloud, we typically use relative pathnames to access files inside the project folder.

Steps for Reading a CSV File as a data.frame

  1. Download custdata_rev.csv from the Class Files module in Brightspace.
    • The file will usually be saved in your computer’s Downloads folder.
    • Do not open this file in a spreadsheet app (e.g., Excel or Numbers).
  2. In Posit Cloud, create a subfolder named data (Files Pane+ Folder).
  3. Click the data folder, then upload the custdata_rev.csv file into it.
    • At the top of the Files Pane, click Upload to choose and add your file.
  4. Read the file using a relative pathname:
library(tidyverse)
# Read the CSV file from the data subfolder
custdata <- read_csv("data/custdata_rev.csv")
view(custdata)  # Opens a `data.frame` in a spreadsheet-like viewer
  • view(DF) (or View(DF)) opens a DF in a spreadsheet-like viewer.

Load a CSV File Directly from the Web into R

# Read the CSV file directly from the web (GitHub repo)
custdata_web <- read_csv(
        'https://bcdanl.github.io/data/custdata_rev.csv')
  • We can load a CSV file directly from the web into R.

Getting to Know a data.frame

class(custdata)
custdata$age
dim(custdata)
nrow(custdata)
ncol(custdata)
summary(custdata)
# Install once
install.packages("skimr")  
library(skimr)
skim(custdata)
  • The $ operator extracts a single column from a data.frame as a vector.
  • dim() shows both the number of rows and columns.
  • nrow() and ncol() give the row count and column count separately.
  • summary() gives a quick overview, while skimr::skim() provides a more detailed, user-friendly summary of variables across all data types.

Observations in data.frame

  • Rows in a data.frame represent individual units or entities for which data is collected.
  • Examples:
    • Student Information: Each row = one student
    • Employee Information: Each row = one employee
    • Daily S&P 500 Index Data: Each row = one trading day
    • Household Survey Data: Each row = one household

Variables in data.frame

  • Columns in a data.frame represent attributes or characteristics measured across multiple observations.
  • Examples:
    • Student Data: Name, Age, Grade, Major
    • Employee Data: EmployeeID, Name, Age, Department
    • Customer Data: CustomerID, Name, Age, Income, HousingType

Note

  • In a data.frame, a variable is a column of data.
  • In general programming, a variable is the name of an object.

Tidy data.frame

Variables, Observations, and Values

table1  # a sample data.frame from the tidyr package (part of the tidyverse)

  • A data.frame is tidy if it follows three rules:

    1. Each variable has its own column.
    2. Each observation has its own row.
    3. Each value has its own cell.
  • A tidy data.frame keeps your data organized, making it easier to understand, analyze, and share in any data analysis.

Data Transformation with dplyr

Data Transformation with dplyr

  • dplyr is a core tidyverse package for data manipulation — tasks like filtering, sorting, selecting, and renaming.
  • Common dplyr functions with data.frame DF:
    • filter(DF, LOGICAL_CONDITIONS)
    • arrange(DF, VARIABLES)
    • distinct(DF, VARIABLES)
    • select(DF, VARIABLES)
    • rename(DF, NEW_NAME = CURRENT_NAME)
  • dplyr functions take a data.frame as the first argument.
    • They return a data.frame as output.

Making dplyr Code Flow with the Pipe Operator

  • The pipe operator (|> or %>%) makes code easier to read by connecting steps in order.
  • How it works:
    • f(x, y) is the same as x |> f(y).
    • Example with a data frame DF:
      • filter(DF, logical_condition)
      • is the same as
        DF |> filter(logical_condition)
  • You can read the pipe as “then”:
    • Example: Take the data, then filter it

Data Transformation Functions with the Pipe

  • Common dplyr functions with the pipe operator:
    • DF |> filter(LOGICAL_CONDITIONS)
    • DF |> arrange(VARIABLES)
    • DF |> distinct(VARIABLES)
    • DF |> select(VARIABLES)
    • DF |> rename(NEW_NAME = CURRENT_NAME)
  • Why this works so well:
    • dplyr functions usually take a data.frame as the first argument.
    • They return a data.frame as output.
  • This allows you to chain together multiple steps naturally:
    • Example: Start with data, then filter, then arrange
      \(\qquad\quad\;\;\) DF |> filter(...) |> arrange(...)

Using the Pipe in Posit Cloud

  • To enable the native pipe operator (|>) in RStudio:
    • Go to Tools > Global Options > Code (side menu).
    • Under Pipe operator, choose Use native pipe operator (|>).
  • Keyboard shortcut for inserting a pipe:
    • Windows: Ctrl + Shift + M
    • Mac: Cmd + Shift + M

Filter observations with filter()

Filter Observations with filter()

install.packages("nycflights13")  # Install once
library(nycflights13)
library(tidyverse)

flights <- nycflights13::flights

flights$month == 1  # A logical test returns TRUE or FALSE
class(flights$month == 12)  
# Flights in January or December
jan <- flights |> 
  filter(month == 1)

dec <- flights |> 
  filter(month == 12)
# Flights on January 1st or December 25th
jan1 <- flights |> 
  filter(month == 1, day == 1)

dec25 <- flights |> 
  filter(month == 12, day == 25)
  • filter() keeps only the observations that meet one or more logical conditions.
    • A logical condition evaluates to TRUE (keep the observation) or FALSE (drop the observation).

Logical Conditions with Equality and Inequality

  • Here, both V1 and V2 are variables, and the comparisons are applied element-wise (vectorized).
  • For logical conditions using inequalities, we focus on cases where V1 and V2 are integer or numeric.

Logical Conditions - Example 1

df <- data.frame(
  num = c(8, 9, 10, 11),
  chr = c("A", "C", "B", "A")
  )
df |> 
  filter(num > 8)
df |> 
  filter(chr == "A")

Logical Operators — AND, OR, NOT

  • Here, both x and y are logical conditions/variables.
  • What logical operations (&, |, !) do is combining logical variables/conditions, which returns a logical variable when executed.

Logical Operations

  • x and y are logical conditions.
    • If x is TRUE, it highlights the left circle.
    • If y is TRUE, it highlights the right circle.

  • The shaded regions show which parts each logical operator selects.

Logical Conditions - Example 2

df <- data.frame(
  num = c(8, 9, 10, 11),
  chr = c("A", "C", "B", "A")
  )
df |> 
  filter(num > 8 & 
         num < 11)
                
df |> 
  filter(num > 8,
         num < 11)
  • In filter(), separating conditions with a comma is equivalent to combining them with the & operator.

Logical Conditions - Example 3

df <- data.frame(
  num = c(8, 9, 10, 11),
  chr = c("A", "C", "B", "A")
  )
df |> 
  filter(num < 10 & 
         chr == "A")

df |> 
  filter(num < 10, 
         chr == "A")

Logical Conditions - Example 4

df <- data.frame(
  num = c(8, 9, 10, 11),
  chr = c("A", "C", "B", "A")
  )
df |> 
  filter(num < 10 | 
         chr == "A")

Missing Values (NA)

NA > 5
10 == NA
NA + 10
NA / 2
(1 + NA + 3) / 3

mean( c(1, NA, 3) )
sd( c(1, NA, 3) )
  • NA (not available) represents a missing or unknown value in R.
  • In most calculations, if one value is unknown, the result will also be unknown (NA).

Comparing Missing Values

v1 <- NA
v2 <- NA
v1 == v2
  • Suppose v1 is Mary’s age (unknown) and v2 is John’s age (unknown).
  • Can we say they are the same age?
  • Since both values are missing, R cannot know — so the result is also NA.

Checking for Missing Values with is.na()

num_missing <- NA
is.na(num_missing)  # is num_missing NA?

text_missing <- "missing"
is.na(text_missing) # is text_missing NA?

V1 <- c(1, NA, 3)
is.na(V1)  # is V1 NA?
!is.na(V1)  # is V1 not NA?
df <- data.frame(
  v1 = c(1, NA, 3),
  v2 = c(1, 2, 3)
  )

df |> 
  filter( is.na(v1) )

df |> 
  filter( !is.na(v1) )
  • Use is.na() to test whether a value is missing (NA).
  • In filter(), you can:
    • Use is.na() to keep observations with missing values.
    • Use !is.na() to remove observations with missing values.

Arrange Observations with arrange()

Arrange Observations with arrange()

# Sort observations 
  # by dep_delay (ascending)
flights |> 
  arrange(dep_delay)
# Sort by dep_delay, 
  # then by sched_dep_time
flights |> 
  arrange(dep_delay, sched_dep_time)
  • arrange() sorts out observations.
  • When you provide multiple variables, arrange() sorts by the first variable, and then uses the next variable(s) to break ties.

Descending Order with desc()

# Sort by departure delay 
  # in descending order
flights |> 
  arrange(desc(dep_delay))
# Equivalent shortcut 
  # for numeric variables
flights |> 
  arrange(-dep_delay)
  • Use desc(VARIABLE) to sort in descending order.
    • For numeric variables, you can also use a leading minus sign.

Arrange Observations with arrange() - Example

df <- data.frame(
  year = c(2024, 2021, 2024, 2024),
  month = c(7, 10, 7, 4),
  day = c(20, 19, 15, 9)
)
df |> 
  arrange(year, month, day)
  • If we provide more than one variable name, each additional variable will be used to break ties in the values of preceding variables.

Find All Unique Observations with distinct()

Find All Unique Observations with distinct()

df <- data.frame(
  country = c("USA", "Korea", "USA"),
  city = c("D.C.", "Seoul", "D.C."),
  subregion = c("Georgetown", 
                "Gangnam", 
                "Georgetown") 
  )
# Remove duplicate observations
df |> 
  distinct()
  • distinct() removes duplicate observations.
    • By default, it checks all variables to find unique observations.

Find All Unique Combinations of Selected Variables

# Remove duplicate observations 
  # from the flights data.frame
flights |> 
  distinct()
# Find all unique 
  #  origin-destination pairs
flights |> 
  distinct(origin, dest)
  • You can pass one or more variables to distinct().
    • This returns unique combinations of just those variables (while ignoring the rest).

Select Variables with select()

Select Variables with select()

  • It’s not uncommon to get datasets with hundreds or thousands of variables.

  • select() allows us to narrow in on the variables we’re actually interested in.

  • We can select variables by their names.

flights |> 
  select(year, month, day)

Remove Variables with select()

flights |> 
  select(-year)
  • With select(-VARIABLES), we can remove variables.

Rename Variables with rename()

Rename Variables with rename()

flights |> 
  rename( tail_num = tailnum )
  • rename() can be used to rename variables:

    • DF |> rename(NEW_NAME = CURRENT_NAME)