Lecture 8

R Basics

Byeong-Hak Choe

SUNY Geneseo

September 13, 2024

R Basics

Descriptive Statistics

  • Descriptive statistics condense data into manageable summaries, making it easier to understand key characteristics of the data.

    • They help reveal patterns, trends, and relationships within the data that might not be immediately apparent from raw numbers.

Descriptive Statistics

  • Data quality assessment:
    • Descriptive statistics can highlight potential issues in data quality, such as outliers or unexpected distributions, prompting further investigation.
  • Foundation for further analysis:
    • Descriptive statistics often serve as a starting point for more advanced statistical analyses and predictive modeling.
  • Data visualization enhancement:
    • Descriptive statistics often form the basis for effective data visualizations, making complex data more accessible and understandable.

Descriptive Statistics

Measures of Central Tendency

  • Measures of centrality are used to describe the central or typical value in a given vector.
    • They represent the “center” or most representative value of a data set.
  • To describe this centrality, several statistical measures are commonly used:
    • Mean: The arithmetic average of all values in the data set.
    • Median: The middle value when the data set is ordered from least to greatest.
    • Mode: The most frequently occurring value in the data set.

Measures of Central Tendency

Mean

\[ \overline{x} = \frac{x_{1} + x_{2} + \cdots + x_{N}}{N} \]

x <- c(1, 2, 3, 4, 5)
sum(x)
mean(x)
  • The arithmetic mean (or simply mean or average) is the sum of all the values divided by the number of observations in the data set.
    • mean() calculates the mean of the values in a vector.
    • For a given vector \(x\), if we happen to have \(N\) observations \((x_{1}, x_{2}, \cdots , x_{N})\), we can write the arithmetic mean of the data sample as above.

Measures of Central Tendency

Median

x <- c(1, 2, 3, 4, 5)
median(x)
  • The median is the measure of center value in a given vector.
    • median() calculates the median of the values in a vector.

Measures of Central Tendency

Mode

  • The mode is the value(s) that occurs most frequently in a given vector.

  • Mode is useful, although it is often not a very good representation of centrality.

  • The R package, modest, provides the mfw(x) function that calculate the mode of values in vector x.

Descriptive Statistics

Measures of Dispersion

  • Measures of dispersion are used to describe the degree of variation in a given vector.
    • They are a representation of the numerical spread of a given data set.
  • To describe this dispersion, a number of statistical measures are developed
    • Range
    • Variance
    • Standard deviation
    • Quartile

Measures of Dispersion

Range

\[ (\text{range of x}) \,=\, (\text{maximum value in x}) \,-\, (\text{minimum value in x}) \]

x <- c(1, 2, 3, 4, 5)
max(x)
min(x)
range <- max(x) - min(x)
  • The range is the difference between the largest and the smallest values in a given vector.
    • max(x) returns the maximum value of the values in a given vector \(x\).
    • min(x) returns the minimum value of the values in a given vector \(x\).

Measures of Dispersion

Variance

\[ \overline{s}^{2} = \frac{(x_{1}-\overline{x})^{2} + (x_{2}-\overline{x})^{2} + \cdots + (x_{N}-\overline{x})^{2}}{N-1}\;\, \]

x <- c(1, 2, 3, 4, 5)
var(x)
  • The variance is used to calculate the deviation of all data points in a given vector from the mean.
    • The larger the variance, the more the data are spread out from the mean and the more variability one can observe in the data sample.
    • To prevent the offsetting of negative and positive differences, the variance takes into account the square of the distances from the mean.
  • var(x) calculates the variance of the values in a vector \(x\).

Measures of Dispersion

Standard Deviation

\[ \overline{s} = \sqrt{ \left( \frac{(x_{1}-\overline{x})^{2} + (x_{2}-\overline{x})^{2} + \cdots + (x_{N}-\overline{x})^{2}}{N-1}\;\, \right) } \]

x <- c(1, 2, 3, 4, 5)
sd(x)
  • The standard deviation (SD)—the square root of the variance—is also a measure of the spread of values within a given vector.
    • sd(x) calculates the standard deviation of the values in a vector \(x\)
    • SD helps us understand how representative the mean is of the data.
      • A low SD suggests that the mean is a good summary, while a high SD suggests greater variability around the mean.

Measures of Dispersion

Quartiles

quantile(x)
quantile(x, 0) # the minimum
quantile(x, 0.25) # the 1st quartile
quantile(x, 0.5) # the 2nd quartile
quantile(x, 0.75) # the 3rd quartile
quantile(x, 1) # the maximum
  • A quartile is a quarter of the number of data points in a given vector.
    • Quartiles are determined by first sorting the values and then splitting the sorted values into four disjoint smaller data sets.
    • Quartiles are a useful measure of dispersion because they are much less affected by outliers or a skewness in the data set than the equivalent measures in the whole data set.

Measures of Dispersion

Interquartile Range

  • An interquartile range describes the difference between the third quartile (Q3) and the first quartile (Q1), telling us about the range of the middle half of the scores in the distribution.
    • The quartile-driven descriptive measures (both centrality and dispersion) are best explained with a popular plot called a box plot.

R Basics

CSV file

  • A CSV (comma-separated values) file is a text file in which values are separated by commas.

  • CSV files are most commonly encountered in spreadsheets and databases.

  • Example

    • https://bcdanl.github.io/data/tvshows.csv

R Basics

Absolute Pathnames

  • Complete path from the root directory to the target file or directory.
  • Independent of the working directory.
    • getwd() returns the pathname of the working directory.
    • The working directory for a Posit Cloud project is /cloud/project/
  • Example of absolute pathnames for custdata_rev.csv
    • Mac:
      • /Users/user/documents/data/custdata_rev.csv
    • Windows:
      • C:\\Users\\user\\Documents\\data\\custdata_rev.csv

R Basics

Relative Pathnames

  • Path relative to the working directory.

  • Example:

    • Absolute pathname for custdata_rev.csv is /Users/user/documents/data/custdata_rev.csv.
    • Suppose the working directory is /Users/user/documents/.
    • Then, the relative pathname for custdata_rev.csv is dada/custdata_rev.csv.
  • When using the Posit Cloud project, we can use a relative path to read a file.

R Basics

Working with Data from Files

  • We use the read_csv() function to read a comma-separated values (CSV) file.
  1. Download the CSV file, custdata_rev.csv from the Class Files module in our Brightspace.

  2. Create a sub-directory, data, by clicking “New Folder” in the Files Pane in Posit Cloud.

  3. Upload the custdata_rev.csv file to the sub-directory data.

  4. Provide the relative pathname for the file, custdata_rev.csv, to the read_csv() function.

custdata <- read_csv('HERE WE PROVIDE A RELATIVE PATHNAME FOR custdata_rev.csv')
View(custdata)
  • View() displays the data in a simple spreadsheet-like grid.

R Basics

Examining data.frames

class(custdata)
dim(custdata)
nrow(custdata)
ncol(custdata)
summary(custdata_rev)
library(skimr)
skim(custdata_rev)
  • dim() shows how many rows and columns are in the data for data.frame.
  • nrow() and ncol() shows the number of rows and columns for data.frame respectively.
  • skimr::skim() refers to the skim() function from the skimr package.
    • This provides a more comprehensive summary.
    • It’s a more user-friendly alternative to functions like base-R’s summary(), offering both numerical and categorical summaries.