Lecture 6

Data Transformation with R

Byeong-Hak Choe

bchoe@geneseo.edu

SUNY Geneseo

September 29, 2025

Reading a Data Table as a `data.frame`

CSV Files

A CSV (comma-separated values) file is a plain text file where each value is separated by a comma.
- CSV files are widely used for storing data from spreadsheets and databases.
Example
- https://bcdanl.github.io/data/tvshows.csv

Absolute Pathnames

An absolute pathname tells the computer the exact location of a file, starting from the very top folder of your computer.
- This location never changes, no matter where you are working in R.
In R, you can see the working directory — the folder where R is currently “looking” for files — by running getwd() in the Console.
- In a Posit Cloud, the working directory is /cloud/project/
Examples of an absolute pathname for custdata_rev.csv:
- On a Mac:
  /Users/user/documents/data/custdata_rev.csv
- On Windows:
  C:\\Users\\user\\Documents\\data\\custdata_rev.csv
  - Note: In Windows, we use double backslashes (\\) because a single backslash (\) is treated as a special character in R.

Relative Pathnames

A relative pathname specifies the location of a file relative to the working directory.
Examples of a relative pathname for custdata_rev.csv:
- Absolute pathname:
  /cloud/project/data/custdata_rev.csv
- Working directory:
  /cloud/project/
- Relative pathname:
  data/custdata_rev.csv
In Posit Cloud, we typically use relative pathnames to access files inside the project folder.

Steps for Reading a CSV File as a `data.frame`

Download custdata_rev.csv from the Class Files module in Brightspace.
- The file will usually be saved in your computer’s Downloads folder.
- Do not open this file in a spreadsheet app (e.g., Excel or Numbers).
In Posit Cloud, create a subfolder named data (Files Pane → + Folder).
Click the data folder, then upload the custdata_rev.csv file into it.
- At the top of the Files Pane, click Upload to choose and add your file.
Read the file using a relative pathname:

library(tidyverse)
# Read the CSV file from the data subfolder
custdata <- read_csv("data/custdata_rev.csv")
view(custdata)  # Opens a `data.frame` in a spreadsheet-like viewer

view(DF) (or View(DF)) opens a DF in a spreadsheet-like viewer.

Load a CSV File Directly from the Web into R

# Read the CSV file directly from the web (GitHub repo)
custdata_web <- read_csv(
        'https://bcdanl.github.io/data/custdata_rev.csv')

We can load a CSV file directly from the web into R.

Getting to Know a `data.frame`

class(custdata)
custdata$age

dim(custdata)
nrow(custdata)
ncol(custdata)

summary(custdata)

# Install once
install.packages("skimr")  
library(skimr)
skim(custdata)

The $ operator extracts a single column from a data.frame as a vector.
dim() shows both the number of rows and columns.
nrow() and ncol() give the row count and column count separately.
summary() gives a quick overview, while skimr::skim() provides a more detailed, user-friendly summary of variables across all data types.

Observations in `data.frame`

Rows in a data.frame represent individual units or entities for which data is collected.
Examples:
- Student Information: Each row = one student
- Employee Information: Each row = one employee
- Daily S&P 500 Index Data: Each row = one trading day
- Household Survey Data: Each row = one household

Variables in `data.frame`

Columns in a data.frame represent attributes or characteristics measured across multiple observations.
Examples:
- Student Data: Name, Age, Grade, Major
- Employee Data: EmployeeID, Name, Age, Department
- Customer Data: CustomerID, Name, Age, Income, HousingType

Note

In a data.frame, a variable is a column of data.
In general programming, a variable is the name of an object.

Tidy `data.frame`

Variables, Observations, and Values

table1  # a sample data.frame from the tidyr package (part of the tidyverse)

A data.frame is tidy if it follows three rules:
1. Each variable has its own column.
2. Each observation has its own row.
3. Each value has its own cell.
A tidy data.frame keeps your data organized, making it easier to understand, analyze, and share in any data analysis.

Data Transformation with `dplyr`

dplyr is a core tidyverse package for data manipulation — tasks like filtering, sorting, selecting, and renaming.

Common dplyr functions with data.frame DF:
- filter(DF, LOGICAL_CONDITIONS)
- arrange(DF, VARIABLES)
- distinct(DF, VARIABLES)
- select(DF, VARIABLES)
- rename(DF, NEW_NAME = CURRENT_NAME)
dplyr functions take a data.frame as the first argument.
- They return a data.frame as output.

Making `dplyr` Code Flow with the Pipe Operator

The pipe operator (|> or %>%) makes code easier to read by connecting steps in order.
How it works:
- f(x, y) is the same as x |> f(y).
- Example with a data frame DF:
  - filter(DF, logical_condition)
  - is the same as
    DF |> filter(logical_condition)
You can read the pipe as “then”:
- Example: Take the data, then filter it

Data Transformation Functions with the Pipe

Common dplyr functions with the pipe operator:
- DF |> filter(LOGICAL_CONDITIONS)
- DF |> arrange(VARIABLES)
- DF |> distinct(VARIABLES)
- DF |> select(VARIABLES)
- DF |> rename(NEW_NAME = CURRENT_NAME)
Why this works so well:
- dplyr functions usually take a data.frame as the first argument.
- They return a data.frame as output.
This allows you to chain together multiple steps naturally:
- Example: Start with data, then filter, then arrange
  $\qquad\quad\;\;$ DF |> filter(...) |> arrange(...)

Using the Pipe in Posit Cloud

To enable the native pipe operator (|>) in RStudio:
- Go to Tools > Global Options > Code (side menu).
- Under Pipe operator, choose Use native pipe operator (|>).

Keyboard shortcut for inserting a pipe:
- Windows: Ctrl + Shift + M
- Mac: Cmd + Shift + M

Filter observations with `filter()`

Filter Observations with `filter()`

install.packages("nycflights13")  # Install once
library(nycflights13)
library(tidyverse)

flights <- nycflights13::flights

flights$month == 1  # A logical test returns TRUE or FALSE
class(flights$month == 12)

# Flights in January or December
jan <- flights |> 
  filter(month == 1)

dec <- flights |> 
  filter(month == 12)

# Flights on January 1st or December 25th
jan1 <- flights |> 
  filter(month == 1, day == 1)

dec25 <- flights |> 
  filter(month == 12, day == 25)

filter() keeps only the observations that meet one or more logical conditions.
- A logical condition evaluates to TRUE (keep the observation) or FALSE (drop the observation).

Logical Conditions with Equality and Inequality

Here, both V1 and V2 are variables, and the comparisons are applied element-wise (vectorized).

For logical conditions using inequalities, we focus on cases where V1 and V2 are integer or numeric.

Logical Conditions - Example 1

df <- data.frame(
  num = c(8, 9, 10, 11),
  chr = c("A", "C", "B", "A")
  )

df |> 
  filter(num > 8)

df |> 
  filter(chr == "A")

Logical Operators — AND, OR, NOT

Here, both x and y are logical conditions/variables.

What logical operations (&, |, !) do is combining logical variables/conditions, which returns a logical variable when executed.

Logical Operations

x and y are logical conditions.
- If x is TRUE, it highlights the left circle.
- If y is TRUE, it highlights the right circle.

The shaded regions show which parts each logical operator selects.

Logical Conditions - Example 2

df <- data.frame(
  num = c(8, 9, 10, 11),
  chr = c("A", "C", "B", "A")
  )

df |> 
  filter(num > 8 & 
         num < 11)
                
df |> 
  filter(num > 8,
         num < 11)

In filter(), separating conditions with a comma is equivalent to combining them with the & operator.

Logical Conditions - Example 3

df <- data.frame(
  num = c(8, 9, 10, 11),
  chr = c("A", "C", "B", "A")
  )

df |> 
  filter(num < 10 & 
         chr == "A")

df |> 
  filter(num < 10, 
         chr == "A")

Logical Conditions - Example 4

df <- data.frame(
  num = c(8, 9, 10, 11),
  chr = c("A", "C", "B", "A")
  )

df |> 
  filter(num < 10 | 
         chr == "A")

Missing Values (`NA`)

NA > 5
10 == NA
NA + 10
NA / 2
(1 + NA + 3) / 3

mean( c(1, NA, 3) )
sd( c(1, NA, 3) )

NA (not available) represents a missing or unknown value in R.

In most calculations, if one value is unknown, the result will also be unknown (NA).

Comparing Missing Values

v1 <- NA
v2 <- NA
v1 == v2

Suppose v1 is Mary’s age (unknown) and v2 is John’s age (unknown).
Can we say they are the same age?

Since both values are missing, R cannot know — so the result is also NA.

Checking for Missing Values with `is.na()`

num_missing <- NA
is.na(num_missing)  # is num_missing NA?

text_missing <- "missing"
is.na(text_missing) # is text_missing NA?

V1 <- c(1, NA, 3)
is.na(V1)  # is V1 NA?
!is.na(V1)  # is V1 not NA?

df <- data.frame(
  v1 = c(1, NA, 3),
  v2 = c(1, 2, 3)
  )

df |> 
  filter( is.na(v1) )

df |> 
  filter( !is.na(v1) )

Use is.na() to test whether a value is missing (NA).

In filter(), you can:
- Use is.na() to keep observations with missing values.
- Use !is.na() to remove observations with missing values.

Arrange Observations with `arrange()`

# Sort observations 
  # by dep_delay (ascending)
flights |> 
  arrange(dep_delay)

# Sort by dep_delay, 
  # then by sched_dep_time
flights |> 
  arrange(dep_delay, sched_dep_time)

arrange() sorts out observations.
When you provide multiple variables, arrange() sorts by the first variable, and then uses the next variable(s) to break ties.

Descending Order with `desc()`

# Sort by departure delay 
  # in descending order
flights |> 
  arrange(desc(dep_delay))

# Equivalent shortcut 
  # for numeric variables
flights |> 
  arrange(-dep_delay)

Use desc(VARIABLE) to sort in descending order.
- For numeric variables, you can also use a leading minus sign.

Arrange Observations with `arrange()` - Example

df <- data.frame(
  year = c(2024, 2021, 2024, 2024),
  month = c(7, 10, 7, 4),
  day = c(20, 19, 15, 9)
)

df |> 
  arrange(year, month, day)

If we provide more than one variable name, each additional variable will be used to break ties in the values of preceding variables.

Find All Unique Observations with `distinct()`

df <- data.frame(
  country = c("USA", "Korea", "USA"),
  city = c("D.C.", "Seoul", "D.C."),
  subregion = c("Georgetown", 
                "Gangnam", 
                "Georgetown") 
  )

# Remove duplicate observations
df |> 
  distinct()

distinct() removes duplicate observations.
- By default, it checks all variables to find unique observations.

Find All Unique Combinations of Selected Variables

# Remove duplicate observations 
  # from the flights data.frame
flights |> 
  distinct()

# Find all unique 
  #  origin-destination pairs
flights |> 
  distinct(origin, dest)

You can pass one or more variables to distinct().
- This returns unique combinations of just those variables (while ignoring the rest).

Select Variables with `select()`

It’s not uncommon to get datasets with hundreds or thousands of variables.
select() allows us to narrow in on the variables we’re actually interested in.
We can select variables by their names.

flights |> 
  select(year, month, day)

Remove Variables with `select()`

flights |> 
  select(-year)

With select(-VARIABLES), we can remove variables.

Rename Variables with `rename()`

flights |> 
  rename( tail_num = tailnum )

rename() can be used to rename variables:
- DF |> rename(NEW_NAME = CURRENT_NAME)

Lecture 6

Reading a Data Table as a data.frame

CSV Files

Absolute Pathnames

Relative Pathnames

Steps for Reading a CSV File as a data.frame

Load a CSV File Directly from the Web into R

Getting to Know a data.frame

Observations in data.frame

Variables in data.frame

Tidy data.frame

Variables, Observations, and Values

Data Transformation with dplyr

Data Transformation with dplyr

Making dplyr Code Flow with the Pipe Operator

Data Transformation Functions with the Pipe

Using the Pipe in Posit Cloud

Filter observations with filter()

Filter Observations with filter()

Logical Conditions with Equality and Inequality

Logical Conditions - Example 1

Logical Operators — AND, OR, NOT

Logical Operations

Logical Conditions - Example 2

Logical Conditions - Example 3

Logical Conditions - Example 4

Missing Values (NA)

Comparing Missing Values

Checking for Missing Values with is.na()

Arrange Observations with arrange()

Arrange Observations with arrange()

Descending Order with desc()

Arrange Observations with arrange() - Example

Find All Unique Observations with distinct()

Find All Unique Observations with distinct()

Find All Unique Combinations of Selected Variables

Select Variables with select()

Select Variables with select()

Remove Variables with select()

Rename Variables with rename()

Rename Variables with rename()

Reading a Data Table as a `data.frame`

Steps for Reading a CSV File as a `data.frame`

Getting to Know a `data.frame`

Observations in `data.frame`

Variables in `data.frame`

Tidy `data.frame`

Data Transformation with `dplyr`

Data Transformation with `dplyr`

Making `dplyr` Code Flow with the Pipe Operator

Filter observations with `filter()`

Filter Observations with `filter()`

Missing Values (`NA`)

Checking for Missing Values with `is.na()`

Arrange Observations with `arrange()`

Arrange Observations with `arrange()`

Descending Order with `desc()`

Arrange Observations with `arrange()` - Example

Find All Unique Observations with `distinct()`

Find All Unique Observations with `distinct()`

Select Variables with `select()`

Select Variables with `select()`

Remove Variables with `select()`

Rename Variables with `rename()`

Rename Variables with `rename()`