Lecture 5

R Basics

Byeong-Hak Choe

SUNY Geneseo

September 17, 2025

Posit Cloud & R Packages

Posit Cloud

  • Posit Cloud (formerly RStudio Cloud) is a web service that delivers a browser-based experience similar to RStudio, the standard IDE for the R language.

  • For our course, we use Posit Cloud for the R programming component.

    • If you want to install R and RStudio on your laptop, you use my office hours.

🚀 Getting Started with Posit Cloud

  1. Click Log In at the top-right corner.
  2. Choose the Sign Up tab from the menu bar.
  3. Create your account using one of the following:
    • Google account, or
    • Your geneseo.edu or personal email
  4. After logging in, go to New ProjectNew RStudio Project.
  5. Create a new R script:
    • Click the + icon (top-left), or
    • Click to FileNew FileR script from the menu bar, or
    • Use the shortcut:
      • Command + Shift + N (Mac)
      • Ctrl + Shift + N (Windows)

Posit Cloud Environment

  • Script Pane is where you write and save R commands in a script file.
    • An R script is simply a plain text file containing R code.
    • Posit Cloud (RStudio Cloud) automatically color-codes your code to make it easier to read.
  • Try typing a <- 1 in the Script Pane.
    • With the cursor ( ┃ ) on the same line, run the code using:
      • Ctrl + Enter (Windows)
      • Command + Enter (Mac)

Posit Cloud Environment

  • Console Pane allows you to interact directly with the R interpreter and type commands where R will immediately execute them.

Posit Cloud Environment

  • Environment Pane shows everything you have created in R so far.
    • For example, if you make a variable or a data frame, it will appear here.
    • Think of it like a workspace where R keeps track of all the things you’re working on.

Posit Cloud Environment

  • Plots Pane contains any graphics that you generate from your R code.

📦 R Packages

  • R packages are collections of ready-made tools for R.

    • A package usually includes functions (pre-written commands), data sets, and sometimes extra code to make your work easier.
  • Many packages are already built into R, and thousands more can be installed from the internet (like downloading apps on your phone).

  • Why use packages?

    • They save time—you don’t need to write everything from scratch.
    • They give you access to powerful tools for data analysis, visualization, and more.
  • Examples:

    • readr → for reading data files quickly
    • dplyr → for cleaning and transforming data
    • ggplot2 → for making beautiful graphs and charts

🌐 tidyverse

  • The tidyverse is a collection of R packages built for data analytics.
    • They share a common design philosophy, grammar, and data structures.
  • Popular packages in the tidyverse include:
    • readr → data reading
    • dplyr → data transformation
    • ggplot2 → data visualization

📦 Installing R Packages with install.packages("packageName")

install.packages("tidyverse")
  • Use the base R function install.packages("packageName") to install new packages.
  • Example: to install the tidyverse, type and run the command above in the R Console.
  • While installing, you may see a pop-up question (e.g., about creating a personal library).
    • It’s usually safe to answer “No” if you’re unsure.
  • While running the above codes, you may encounter the pop-up question, and you can answer “No”

📂 Loading R Packages with library(packageName)

library(tidyverse)
mpg
  • After installation, use library(packageName) to load a package into your R session.
  • Example: running library(tidyverse) loads all the R packages in the tidyverse, including readr, dplyr, and ggplot2.
  • Once loaded, you can use their functions and datasets.
  • For instance, mpg is a built-in dataset from ggplot2, which is part of the tidyverse.

Workflow for R packages: Install → Load → Use

  1. Install (once)
install.packages("tidyverse")
  1. Load (At the top of every new R script, load the package:)
library(tidyverse)
  1. Use (functions & datasets)
df <- read_csv("https://bcdanl.github.io/data/spotify_all.csv")
  • read_csv() function comes from the readr package, which is part of the tidyverse.

Note

  • 🔑 Tip: In Posit Cloud, you need to install a package once per project.
  • After that, just load it whenever you start a new R script.

Workflow: Naming and File Management

  • Always save your R script for each class session.
    • Go to File → Save (or Save As…), or
    • Click the 💾 save icon.
  • ✅ Recommended file naming style (no spaces):
    • Example: danl-101-2024-0917.R
  • Tips for naming files:
    • ❌ Avoid spaces in file names.
    • ✅ Use lowercase letters and hyphens (-) / underscores (_)

Workflow: Code and comment style

  • The two main principles for coding and managing data are:
    • Make things easier for your future self.
    • Don’t trust your future self.
  • The # mark is R’s comment character.
    • In R scripts (*.R files), # indicates that the rest of the line is to be ignored.
    • Write comments before the line that you want the comment to apply to.

Workflow: Shortcuts in Posit Cloud

  • Windows
    • Alt + - adds an assignment operator (<-)
    • Ctrl + Enter runs a current line of code
    • Ctrl + Shift + C makes a comment (- #)
    • Ctrl + Shift + R makes a section (- # Section - - - -)
  • Mac
    • option + - adds an assignment operator (<-)
    • command + return runs a current line of code
    • command + shift + C makes a comment (- #)
    • command + shift + R makes a section (- # Section - - - -)

Workflow: Shortcuts in Posit Cloud

  • Ctrl (command for Mac Users) + Z undoes the previous action.
  • Ctrl (command for Mac Users) + Shift + Z redoes when undo is executed.

Workflow: Shortcuts in Posit Cloud

  • Ctrl (command for Mac Users) + F is useful when finding a phrase (and replace the phrase) in the RScript.

Workflow: Auto-completion

libr

  • Auto-completion of command is useful.
    • Type libr in the RScript in RStudio and wait for a second.

Workflow: STOP icon

  • When the code is running, RStudio shows the STOP icon ( 🛑 ) at the top right corner in the Console Pane.
    • Do not click it unless if you want to stop running the code.

Posit Cloud Options Setting

  • This option menu is found by menus as follows:
    • Tools \(>\) Global Options
  • Check the boxes as in the left.
  • Choose the option Never for Save workspace to .RData on exit:

R Programming Basics

Values, Variables, and Types

  • A value is datum (literal) such as a number or text.

  • There are different types of values:

    • 352.3 is known as a float or double;
    • 22 is an integer;
    • “Hello World!” is a string.

Values, Variables, and Types

a <- 10    # The most popular assignment operator in R is `<-`.
a

  • A variable is a name that refers to a value.
    • We can think of a variable as a box that has a value, or multiple values, packed inside it.
  • A variable is just a name!

Objects

  • Sometimes you will hear variables referred to as objects.

  • Everything that is not a literal value, such as 10, is an object.

Assignment

x <- 2
x < - 3
  • What is going on here?

  • The shortcut for the assignment <- is:

    • Windows: Alt + -
    • Mac: option + -

Assignment

x <- 2
y <- x + 12
  • In programming code, everything on the right side needs to have a value.
    • The right side can be a literal value, or a variable that has already been assigned a value, or a combination.
  • When R reads y <- x + 12, it does the following:
    1. Sees the <- in the middle.
    2. Knows that this is an assignment.
    3. Calculates the right side (gets the value of the object referred to by x and adds it to 12).
    4. Assigns the result to the left-side variable, y.

Data Types

  • Logical: TRUE or FALSE.
  • Numeric: Numbers with decimals
  • Integer: Integers
  • Character: Text strings
  • Factor: Categorical values.
    • Each possible value of a factor is known as a level.

Data Containers

  • vector → a single column of values, all of the same type
    • Example:
      • c(1, 2, 3)
      • c("red", "blue", "green")
  • data.frame → a table with rows and columns, where each column can be a different type
    • Example: one column with names, another with ages
    • A data.frame is basically several vectors put together side by side.

Data Types

orig_number <- 4.39898498
class(orig_number)

mod_number <- as.integer(orig_number)
class(mod_number)
# Logical values (TRUE/FALSE) 
  # can convert to numbers:
  # TRUE converts to 1; 
  # FALSE does to 0.
as.numeric(TRUE)
as.numeric(FALSE)
  • Values have different data types (e.g., numeric, integer, character, factor, logical).
  • Sometimes we need to convert (cast) a value from one type to another.
  • Use built-in functions like:
    • as.character() → convert to character
    • as.integer() → convert to whole numbers
    • as.numeric() → convert to numbers with decimals
    • as.factor() → convert to categorical (factor)

Data Types - Character

myname <- "my_name"
class(myname) # returns the data **type** of an object.
  • Strings (text) are stored as the data type character.
  • Wrap text in either double quotes (" ") or single quotes (' ').
  • Example: "hello" or 'hello'
  • Most IDEs, including Posit Cloud (RStudio), will auto-complete the closing quote when you type the first one.

Data Types - Numbers

favorite.integer <- as.integer(2)
class(favorite.integer)

favorite.numeric <- as.numeric(8.8)
class(favorite.numeric)
  • Numbers can belong to different classes.
  • The two most common are:
    • integer → whole numbers (e.g., 2, -5, 100)
    • numeric → numbers with decimals (e.g., 8.8, -3.14, 0.5)

Data Types - Logical (TRUE/FALSE)

class(TRUE)
class(FALSE)

favorite.numeric == 8.8
favorite.numeric == 9.9
class(favorite.numeric == 8.8)
  • Logical values represent TRUE or FALSE.
  • The operator == is used to test for equality.
  • Example:
    • favorite.numeric == 8.8 returns TRUE
    • favorite.numeric == 9.9 returns FALSE

Data Types - Vectors

a <- 1:10   # create a sequence using the colon operator
b <- c("3", 4, 5)   # mixing numbers and text
beers <- c("BUD LIGHT", "BUSCH LIGHT", "COORS LIGHT", 
           "MILLER LITE", "NATURAL LIGHT")

class(a)
class(b)
class(beers)
  • A vector is a one-dimensional data structure in R.
  • All elements in a vector must be of the same type (numeric, character, or logical).
  • If types are mixed, R will coerce them to a common type (e.g., numbers become text in b).
  • The function c(...) (combine or concatenate) creates vectors by putting values together in order.

Data Types - Factors

beers <- as.factor(beers)
class(beers)

levels(beers)
nlevels(beers)
  • Factors are used to store categorical data (data that falls into groups or categories).
  • Internally, R stores factors as integers with a text label for each unique value (for speed and efficiency).
  • Example: if you have a factor of "Freshman", "Sophomore", "Junior", and "Senior" student classifications, R stores them as numbers (e.g., 1, 2, 3, 4) but displays the labels.
  • Functions:
    • levels() → shows the categories (unique labels)
    • nlevels() → shows how many categories there are

Workflow: Quotation marks, parentheses, and +

x <- "hello
  • Quotation marks and parentheses must always come in pairs.
  • If one is missing, the R Console will show a continuation prompt: +
    • This means R is still waiting for more input and doesn’t think your command is finished.

⚙️ Functions

  • A function is a reusable piece of code: it takes input (parameters) and produces output (results).
  • R comes with many built-in functions (e.g., sum(), mean()).
  • You can also write your own functions in R.
    • In our course, we will use only built-in functions. m

Functions, Arguments, and Parameters

library(tidyverse)

# The function `str_c()`, provided by `tidyverse`, concatenates characters.
str_c("Data", "Analytics")
str_c("Data", "Analytics", sep = "!")
  • We use a function by entering its name and a pair of opening and closing parentheses.

  • Much as a cooking recipe can accept ingredients, a function invocation can accept inputs called arguments.

  • We pass arguments sequentially inside the parentheses (, separated by commas).

  • A parameter is a name given to an expected function argument.

  • A default argument is a fallback value that R passes to a parameter if the function invocation does not explicitly provide one.

Math Algebra

5 + 3
5 - 3
5 * 3
5 / 3
5^3
( 3 + 4 )^2
3 + 4^2
3 + 2 * 4^2
3 + 2 * 4 + 2
(3 + 2) * (4 + 2)
  • All of the basic operators with parentheses we see in mathematics are available to use.

  • R can be used for a wide range of mathematical calculations.

Math functions

5 * abs(-3)
sqrt(17) / 2
exp(3)
log(3)
log(exp(3))
exp(log(3))
  • R has many built-in mathematical functions that facilitate calculations and data analysis.
  • abs(x): the absolute value \(|x|\)
  • sqrt(x): the square root \(\sqrt{x}\)
  • exp(x): the exponential value \(e^x\), where \(e = 2.718...\)
  • log(x): the natural logarithm \(\log_{e}(x)\), or simply \(\log(x)\)

Vectorized Operations

a <- c(1, 2, 3, 4, 5)
b <- c(5, 4, 3, 2, 1)

a + b
a - b
a * b
a / b
sqrt(a)
  • Vectorized operations mean applying a function to every element of a vector without explicitly writing a loop.
    • This is possible because most functions in R are vectorized, meaning they are designed to operate on vectors element-wise.
    • Vectorized operations are a powerful feature of R, enabling efficient and concise code for data analysis and manipulation.

Accessing Package Objects and Functions

# Access an object directly from a package
ggplot2::mpg   # PACKAGE_NAME::DATA_FRAME_NAME

# Call a function directly from a package
ggplot2::ggplot()   # PACKAGE_NAME::FUNCTION_NAME
  • Use the double colon :: operator when you want to access:
    • A data frame from a specific package (e.g., ggplot2::mpg)
    • A function without loading the entire package (e.g., ggplot2::ggplot())
  • ✅ This can be useful because:
    • It avoids name conflicts if two packages have functions with the same name.
    • It saves time and memory by letting you call just one function or dataset without loading the entire package.

Descriptive Statistics

Descriptive Statistics

  • Descriptive statistics condense data into manageable summaries, making it easier to understand key characteristics of the data.

    • They help reveal patterns, trends, and relationships within the data that might not be immediately apparent from raw numbers.

Descriptive Statistics

  • Data quality assessment:
    • Descriptive statistics can highlight potential issues in data quality, such as outliers or unexpected distributions, prompting further investigation.
  • Foundation for further analysis:
    • Descriptive statistics often serve as a starting point for more advanced statistical analyses and predictive modeling.
  • Data visualization enhancement:
    • Descriptive statistics often form the basis for effective data visualizations, making complex data more accessible and understandable.

Measures of Central Tendency

  • Measures of centrality are used to describe the central or typical value in a given vector.
    • They represent the “center” or most representative value of a data set.
  • To describe this centrality, several statistical measures are commonly used:
    • Mean: The arithmetic average of all values in the data set.
    • Median: The middle value when the data set is ordered from least to greatest.
    • Mode: The most frequently occurring value in the data set.

Measures of Central Tendency - Mean

\[ \overline{x} = \frac{x_{1} + x_{2} + \cdots + x_{N}}{N} \]

x <- c(1, 2, 3, 4, 5)
sum(x)
mean(x)
  • The arithmetic mean (or simply mean or average) is the sum of all the values divided by the number of observations in the data set.
    • mean() calculates the mean of the values in a vector.
    • For a given vector \(x\), if we happen to have \(N\) observations \((x_{1}, x_{2}, \cdots , x_{N})\), we can write the arithmetic mean of the data sample as above.

Measures of Central Tendency - Median

x <- c(1, 2, 3, 4, 5)
median(x)
  • The median is the measure of center value in a given vector.
    • median() calculates the median of the values in a vector.

Measures of Central Tendency - Mode

  • The mode is the value(s) that occurs most frequently in a given vector.

  • Mode is useful, although it is often not a very good representation of centrality.

  • The R package, modest, provides the mfw(x) function that calculate the mode of values in vector x.

Measures of Dispersion

  • Measures of dispersion are used to describe the degree of variation in a given vector.
    • They are a representation of the numerical spread of a given data set.
  • To describe this dispersion, a number of statistical measures are developed
    • Range
    • Variance
    • Standard deviation
    • Quartile

Measures of Dispersion - Range

\[ (\text{range of x}) \,=\, (\text{maximum value in x}) \,-\, (\text{minimum value in x}) \]

x <- c(1, 2, 3, 4, 5)
max(x)
min(x)
range <- max(x) - min(x)
  • The range is the difference between the largest and the smallest values in a given vector.
    • max(x) returns the maximum value of the values in a given vector \(x\).
    • min(x) returns the minimum value of the values in a given vector \(x\).

Measures of Dispersion - Variance

\[ \overline{s}^{2} = \frac{(x_{1}-\overline{x})^{2} + (x_{2}-\overline{x})^{2} + \cdots + (x_{N}-\overline{x})^{2}}{N-1}\;\, \]

x <- c(1, 2, 3, 4, 5)
var(x)
  • The variance is used to calculate the deviation of all data points in a given vector from the mean.
    • The larger the variance, the more the data are spread out from the mean and the more variability one can observe in the data sample.
    • To prevent the offsetting of negative and positive differences, the variance takes into account the square of the distances from the mean.
  • var(x) calculates the variance of the values in a vector \(x\).

Measures of Dispersion - Standard Deviation

\[ \overline{s} = \sqrt{ \left( \frac{(x_{1}-\overline{x})^{2} + (x_{2}-\overline{x})^{2} + \cdots + (x_{N}-\overline{x})^{2}}{N-1}\;\, \right) } \]

x <- c(1, 2, 3, 4, 5)
sd(x)
  • The standard deviation (SD)—the square root of the variance—is also a measure of the spread of values within a given vector.
    • sd(x) calculates the standard deviation of the values in a vector \(x\)
    • SD helps us understand how representative the mean is of the data.
      • A low SD suggests that the mean is a good summary, while a high SD suggests greater variability around the mean.

Measures of Dispersion - Quartiles

quantile(x)
quantile(x, 0) # the minimum
quantile(x, 0.25) # the 1st quartile
quantile(x, 0.5) # the 2nd quartile
quantile(x, 0.75) # the 3rd quartile
quantile(x, 1) # the maximum
  • A quartile is a quarter of the number of data points in a given vector.
    • Quartiles are determined by first sorting the values and then splitting the sorted values into four disjoint smaller data sets.
    • Quartiles are a useful measure of dispersion because they are much less affected by outliers or a skewness in the data set than the equivalent measures in the whole data set.

Measures of Dispersion - Interquartile Range

  • An interquartile range describes the difference between the third quartile (Q3) and the first quartile (Q1), telling us about the range of the middle half of the scores in the distribution.
    • The quartile-driven descriptive measures (both centrality and dispersion) are best explained with a popular plot called a box plot.

Accessing a Subset of a vector

Accessing a Subset of a vector

  • A key part of working with a vector is knowing how to index them.
    • Indexing allows you to filter and extract subsets of data for further analysis.
  • For vectors, there are three common ways to index:
  1. Single index
vec[n]   # element at position n
  1. Multiple indices
vec[c(i, j, k)]   # elements at positions i, j, and k
  1. Logical indexing
vec[condition]   # elements where condition is TRUE

Positional Indexing

  • An index is a positional reference (e.g., 1, 2, 3) used to access individual elements within data structures like a vector.
my_vector <- c(10, 20, 30, 40, 50, 60)

my_vector[2]
my_vector[4]
my_vector[6]
  • In R, the index is positive integer, starting at 1.

A Vector of Indices

  • Selecting multiple elements by providing a vector of indices
my_vector <- c(10, 20, 30, 40, 50, 60)

my_vector[ c(3,4,5) ]
my_vector[ 3:5 ]

Logical Indexing

  • Using a logical condition to filter elements of a vector.
my_vector <- c(10, 20, 30, 40, 50, 60)

# Filter elements greater than 10
is_greater_than_10 <- my_vector > 10  # Creates logical vector
my_vector[ is_greater_than_10 ]