Lecture 11

Data Visualization - Aesthetic Mappings and Facets

Byeong-Hak Choe

SUNY Geneseo

March 5, 2024

RStudio Workflow

Must-know Quarto Shortcuts

Mac

  • option + command + I: to create a R chunk
  • option + - : the shortcut for <-
  • command + return runs a current line or selected lines
  • command + shift + return: to run the code in the R chunk
  • command + shift + C: to comment out a line
  • command + shift + K: to render a current Quarto file

Windows

  • Alt+Ctrl+I : to create a R chunk
  • Alt + - : the shortcut for <-
  • Ctrl + Enter runs a current line or selected lines
  • Ctrl + Shift + Enter : to run the code in the R chunk
  • Ctrl + Shift + C: to comment out a line
  • Ctrl + Shift + K: to render a current Quarto file

Documenting Workflow

Must-know Shortcuts

  • Use the shortcuts below whenever you edit a document, including Quarto:

Mac

  • command + ⬆️/⬇️/⬅️/➡️

  • shift + ⬆️/⬇️/⬅️/➡️

  • command + shift + ⬆️/⬇️/⬅️/➡️

  • command + PgUp/PgDn

  • shift + PgUp/PgDn

  • command + shift + PgUp/PgDn:

Windows

  • Ctrl + ⬆️/⬇️/⬅️/➡️

  • Shift + ⬆️/⬇️/⬅️/➡️

  • Ctrl + shift + ⬆️/⬇️/⬅️/➡️

  • Ctrl + PgUp/PgDn

  • Shift + PgUp/PgDn

  • Ctrl + Shift + PgUp/PgDn:

Review

Aesthetic Mappings

  • An aesthetic is a visual property (e.g., size, shape, color) of the objects (e.g., class) in our plot.

  • We can display a point in different ways by changing the values of its aesthetic properties.

Review

Overplotting

  • Many points overlap each other in a scatterplot with a sufficiently large amount of data points.
    • This problem is known as overplotting.
    • Overplotting can obscure patterns and outliers, leading to potentially misleading conclusions.
  • The simplest solution to overplotting is setting alpha manually.

ggplot Basics

Learning Objectives

  • Aesthetic Mappings
  • Facets
  • Relationship Plots
  • Distribution Plots
  • Time Trend Plots
  • Geometric Objects

Data Visualization - First Steps

Graphing Template

ggplot(data = <DATA.FRAME>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
  • To make a ggplot plot, replace the bracketed sections in the code above with a data.frame, a geom function, or a collection of mappings such as x = VAR_1 and y = VAR_2.

Aesthetic Mappings

Categorical vs. Continuous Variables

  • A categorical variable is a variable whose value is obtained by counting.
    • Students’ letter grade (A/A-/B+/…)
    • Student Classification (Freshmen/Sophomore/Junior/Senior)
    • Beer brands
    • US state/county
    • Something with a small number of categories
  • We can use as.factor(variable) to make a variable categorical.
  • A continuous variable is a variable whose value is obtained by measuring and can have a decimal or fractional value.
    • Height/weight of students
    • Time it takes to get to school
    • Fuel efficiency of a vehicle (e.g., miles per gallon)
    • Atmospheric carbon dioxide (CO2) concentrations
    • Something that can take any numeric value within a particular range
  • We can use as.numeric(variable) to make a variable continuous.
  • For data visualization, integer-type variables could be treated as either categorical or continuous, depending on the context of analysis.

  • If the values of an integer-type variable means an intensity or an order, the integer variable could be continuous.

    • A variable of age integers (18, 19, 20, 21, …) could be continuous.
    • A variable of integer-valued MPG (27, 28, 29, 30, …) could be continuous.
  • If not, the integer variable is categorical.

    • A variable of month integers (1, ,2, …, 12) could be categorical.

Data Visualization

Aesthetic Mappings

Facets

Facets

  • One way to add a variable, particularly useful for categorical variables, is to use facets to split our plot into facets, subplots that each display one subset of the data.

Facets

facet_wrap( VAR ~ . )

ggplot(data = mpg) + 
  geom_point(mapping = 
               aes(x = displ, 
                   y = hwy), 
             alpha = .5) + 
  facet_wrap( class ~ .)

  • To facet our plot by a single variable, we can use facet_wrap().

Facets

facet_wrap( VAR ~ . )

ggplot(data = mpg) + 
  geom_point(mapping = 
               aes(x = displ, 
                   y = hwy), 
             alpha = .5) + 
  facet_wrap( class ~ . , nrow = 2)

  • nrow (ncol) determines the number of rows (columns) to use when laying out the facets.

Facets

facet_grid( VAR_ROW ~ VAR_COL )

  • To facet our plot on the combination of two variables, add facet_grid( VAR_ROW ~ VAR_COL ) to our plot call.

  • The first argument of facet_grid() is also a formula.

    • This time the formula should contain two variable names separated by a ~.

Facets

facet_grid( VAR_ROW ~ VAR_COL )

ggplot(data = mpg) + 
  geom_point(mapping = 
               aes(x = displ, 
                   y = hwy),
             alpha = .5) + 
  facet_grid(drv ~ cyl)

Facets

scales in Facetting

  • Option scales in facet_*() is whether scales is
    • fixed ("fixed", the default),
    • free in one dimension ("free_x", "free_y"), or
    • free in two dimensions ("free").

Facets

scales in Facetting

ggplot(data = mpg) + 
  geom_point(mapping = 
               aes(x = displ, 
                   y = hwy),
             alpha = .5) + 
  facet_grid(drv ~ cyl, 
             scales = "free_x")

Data Visualization

Facets

Making Discoveries from Data Visualization

Making Discoveries from Data Visualization

Geometric Objects

How are these two plots similar?

Making Discoveries from Data Visualization

Geometric Objects

  • A geom_*() is the geometrical object that a plot uses to represent data.
    • Bar charts use geom_bar() or geom_col();
    • Histograms use geom_histogram() or geom_freqpoly();
    • Line charts use geom_line();
    • Boxplots use the geom_boxplot();
    • Scatterplots use the geom_point();
    • Fitted lines use the geom_smooth();
    • and many more!
  • We can use different geom_*() to plot the same data.

Making Discoveries from Data Visualization

Relationship

  • From the plots with two or more variables, we want to see co-variation, the tendency for the values of two or more variables to vary together in a related way.

  • What type of co-variation occurs between variables?

    • Are they positively associated?
    • Are they negatively associated?
    • Are there no association between them?

Making Discoveries from Data Visualization

Describing a Relationship with Scatterplots

ggplot(data = mpg) + 
  geom_point(mapping = 
               aes(x = displ, 
                   y = hwy),
             alpha = .3)

Making Discoveries from Data Visualization

Describing a Relationship with Fitted lines

ggplot(data = mpg) + 
  geom_smooth(mapping = 
                aes(x = displ, 
                    y = hwy))

Making Discoveries from Data Visualization

Describing a Relationship with Scatterplots plus Fitted lines

ggplot(data = mpg) + 
  geom_point(mapping = 
               aes(x = displ, 
                   y = hwy),
             alpha = .3) +
  geom_smooth(mapping = 
                aes(x = displ, 
                    y = hwy))

Making Discoveries from Data Visualization

Distribution

  • A distribution of a variable refers to the way its values are spread or arranged across a data.frame.
    • What type of variation occurs within a variable?
  • Variation is the tendency of the values of a variable to change from measurement to measurement.
    • We can see variation easily in real life; if we measure any continuous variable twice, we will be likely to get two different values.
    • Which values are the most common? Why?
    • Which values are rare? Why? Does that match your expectations?

Making Discoveries from Data Visualization

Describing a Dristirubion with Histograms

ggplot(data = mpg) + 
  geom_histogram(mapping = 
                aes(x = displ))

Making Discoveries from Data Visualization

Describing a Dristirubion with Boxplots

ggplot(data = mpg) + 
  geom_boxplot(mapping = 
                aes(x = class, 
                    y = hwy))

  • geom_boxplot() is used to create box plots (also known as box-and-whisker plots).
    • Box plots provide a visual summary of the distribution of a variable, showing the median, quartiles, and potential outliers.

Making Discoveries from Data Visualization

Describing a Dristirubion with Bar charts

ggplot(data = diamonds) + 
  geom_bar(mapping = 
             aes(x = cut))

  • The diamonds data.frame comes in ggplot2 and contains information about ~54,000 diamonds, including cut variable.

Making Discoveries from Data Visualization

Describing a Dristirubion with Bar charts

  • The figure below describes how geom_bar() transforms the data.frame.

Making Discoveries from Data Visualization

Time Trend

  • A time trend plot, (also known as a time series plot), is used to visualize trends, patterns, and fluctuations in a variable over a specific time period.

    • The x-axis typically represents time, while the y-axis represents the variable being measured.
  • We can check the overall direction in which the time-series variable are moving—upwards, downwards, or staying relatively constant over time.

Making Discoveries from Data Visualization

Describing Time Trend with Line charts

path <- 
  "https://bcdanl.github.io/data/NVDA.csv"
nvda <- read_csv(path)

ggplot(data = nvda) + 
  geom_line(mapping = 
                aes(x = Date, 
                    y = Close))

  • The nvda data.frame includes NVIDIA’s stock information from 2019-01-02 to 2024-03-04.

Making Discoveries from Data Visualization

Data Visualization with ggplot()

  • Step 1. Figure out whether variables of interests are categorical or continuous.

  • Step 2. Think which geometric objects, aesthetic mappings, and faceting are appropriate to visualize distributions and relationships.

  • Step 3. If needed, transform a given data.frame (e.g., filtered observations, new variables, summarized data) and try new visualizations.

Making Discoveries from Data Visualization

Geometric objects

  • A distribution of a categorical variable (e.g., geom_bar() and more)
  • A distribution of a continuous variable (e.g., geom_histogram() and more)
  • A relationship between two categorical variables (e.g., geom_bar() and more)
  • A relationship between two continuous variables (e.g., geom_point() with geom_smooth() and more)
  • A relationship between a categorical variable and a continuous variable (e.g., geom_boxplot() and more)
  • A time trend of a categorical variable (e.g., geom_bar() and more)
  • A time trend of a continuous variable (e.g., geom_line() and more)

Geometric Objects

Geometric Objects

  • Every geom function in ggplot2 takes a mapping argument.

  • However, not every aesthetic works with every geom.

    • We could set the shape of a point, but we could not set the shape of a line;
    • We could set the linetype of a line.

Geometric Objects

geom_smooth()

ggplot( data = mpg ) + 
  geom_smooth( mapping = 
                 aes( x = displ, 
                      y = hwy, 
                      linetype = drv) )

Geometric Objects

geom_smooth(method = lm)

  • Setting method = lm manually in geom_smooth() gives a straight line that fits into data points.
ggplot( data = mpg ) + 
  geom_smooth( mapping = 
                 aes( x = displ, 
                      y = hwy),
               method = lm)

Geometric Objects

geom_smooth(group = CATEGORICAL_VAR)

  • We can set the group aesthetic to a categorical variable to draw multiple objects.
    • ggplot2 will draw a separate object for each unique value of the grouping variable.

Geometric Objects

geom_smooth(group = CATEGORICAL_VAR)

ggplot(data = mpg) +
  geom_smooth(mapping = 
                aes(x = displ, 
                    y = hwy))

Geometric Objects

geom_smooth(group = CATEGORICAL_VAR)

ggplot(data = mpg) +
  geom_smooth(mapping = 
                aes(x = displ, 
                    y = hwy, 
                    group = drv))

Geometric Objects

  • In practice, ggplot2 will automatically group the data for these geoms whenever we map an aesthetic to a categorical variable (as in the linetype example).
ggplot(data = mpg) +
  geom_smooth(
    mapping = aes(x = displ, 
                  y = hwy, 
                  color = drv),
    show.legend = FALSE
  )

Geometric Objects

  • To display multiple geometric objects in the same plot, add multiple geom_*() functions to ggplot():
ggplot(data = mpg) + 
  geom_point(mapping = 
               aes(x = displ, 
                   y = hwy),
             alpha = .3) +
  geom_smooth(mapping = 
                aes(x = displ, 
                    y = hwy)) +
  geom_smooth(mapping = 
                aes(x = displ, 
                    y = hwy), 
              method = lm, 
              color = 'red')

Geometric Objects

Layers

  • Using geom_point(), geom_smooth(), and geom_smooth(method = lm) together is an excellent option to visualize the relationship between the two variables.

Geometric Objects

Layers

  • If we place mappings in a geom function, ggplot2 will treat them as local mappings for the layer.
ggplot(data = mpg, 
       mapping = 
         aes(x = displ, 
             y = hwy)) + 
  geom_point(mapping = 
               aes(color = class),
             alpha = .3) + 
  geom_smooth()

Geometric Objects

Multiple data.frames

df_subcompact <- filter(mpg, class == "subcompact")
  • We can use the same idea to specify different data for each layer.

  • Here, our smooth line displays just a subset of the mpg data.frame, the subcompact cars.

    • filter() is the tidyverse-way to filter observations in a data.frame.
  • The local data argument in geom_smooth() overrides the global data argument in ggplot() for that layer only.

Geometric Objects

Multiple data.frames

ggplot(data = mpg, 
       mapping = 
         aes(x = displ, 
             y = hwy)) + 
  geom_point(mapping = 
               aes(color = class), 
             alpha = .3) + 
  geom_smooth(data = df_subcompact, 
              se = FALSE)

  • The standard error (se) tells us how much the predicted values from a model might differ from the actual values we’re trying to predict.

Data Visualization

Geometric Objects