Lecture 3

ggplot basics

Byeong-Hak Choe

SUNY Geneseo

January 29, 2025

Learning Objectives

  • ggplot grammar

Data Visualization with ggplot()

Grammar of Graphics

A grammar of graphics is a tool that enables us to concisely describe the components of a graphic

Aesthetic Mappings

Aesthetic Mappings

  • An aesthetic is a visual property (e.g., size, shape, color) of the objects (e.g., class) in your plot.

  • You can display a point in different ways by changing the values of its aesthetic properties.

Aesthetic Mappings

Adding a color to the plot

ggplot(data = mpg) + 
  geom_point(mapping = 
               aes(x = displ, 
                   y = hwy, 
                   color = class) )

Aesthetic Mappings

Adding a shape to the plot

ggplot(data = mpg) + 
  geom_point(mapping = 
               aes(x = displ, 
                   y = hwy, 
                   shape = class) )

Aesthetic Mappings

Adding a size to the plot

ggplot(data = mpg) + 
  geom_point(mapping = 
               aes(x = displ, 
                   y = hwy, 
                   size = class) )

Aesthetic Mappings

Adding an alpha (transparency) to the plot

ggplot(data = mpg) + 
  geom_point(mapping = 
               aes(x = displ, 
                   y = hwy, 
                   alpha = class) )

Aesthetic Mappings

Overplotting problem

  • Many points overlap each other.

    • This problem is known as overplotting.
  • When points overlap, it’s hard to know how many data points are at a particular location.

  • Overplotting can obscure patterns and outliers, leading to potentially misleading conclusions.

  • We can set a transparency level (alpha) between 0 (full transparency) and 1 (no transparency).

Aesthetic Mappings

Overplotting and alpha

ggplot(data = mpg) + 
  geom_point(mapping = 
               aes(x = displ, 
                   y = hwy),
             alpha = .2)

Aesthetic Mappings

Overplotting and size

  • We can also consider setting size smaller than 1.
ggplot(data = mpg) + 
  geom_point(mapping = 
               aes(x = displ, 
                   y = hwy),
             size = .5)

Aesthetic Mappings

Specifying a color to the plot

ggplot(data = mpg) + 
  geom_point(mapping = 
               aes(x = displ, 
                   y = hwy), 
             color = "blue")

Aesthetic Mappings

  • To set an aesthetic manually, set the aesthetic by name as an argument of your geom_ function; i.e. it goes outside of aes().
    • You’ll need to pick a level that makes sense for that aesthetic:
      • The name of a color as a character string.
      • The size of a point in mm.
      • The shape of a point as a number, as shown below.

Aesthetic Mappings

Specifying a color to the plot?

ggplot(data = mpg) + 
  geom_point( mapping = 
                aes(x = displ, 
                    y = hwy, 
                    color = "blue") )

Common problems in ggplot()

  • One common problem when creating ggplot2 graphics is to put the + in the wrong place.
ggplot(data = mpg) 
 + geom_point( mapping = 
                 aes(x = displ, 
                     y = hwy) )

Facets

Facets

  • One way to add a variable, particularly useful for categorical variables, is to use facets to split our plot into facets, subplots that each display one subset of the data.

Facets

  • To facet our plot by a single variable, use facet_wrap().
ggplot(data = mpg) + 
  geom_point(mapping = 
               aes(x = displ, 
                   y = hwy), 
             alpha = .5) + 
  facet_wrap( . ~ class, nrow = 2)

Facets

  • To facet our plot on the combination of two variables, add facet_grid( VAR_ROW ~ VAR_COL ) to our plot call.

Facets

  • The first argument of facet_grid() is also a formula.
    • This time the formula should contain two variable names separated by a ~.

Facets

ggplot(data = mpg) + 
  geom_point(mapping = 
               aes(x = displ, 
                   y = hwy),
             alpha = .5) + 
  facet_grid(drv ~ cyl)

Facets

  • Option scales in facet_*() is whether scales is
    • fixed ("fixed", the default),
    • free in one dimension ("free_x", "free_y"), or
    • free in two dimensions ("free").

Facets

ggplot(data = mpg) + 
  geom_point(mapping = 
               aes(x = displ, 
                   y = hwy),
             alpha = .5) + 
  facet_grid(drv ~ cyl, 
             scales = "free_x")

Geometric Objects

Geometric Objects

How are these two plots similar?

Geometric Objects

  • A geom_*() is the geometrical object that a plot uses to represent data.
    • Bar charts use geom_bar();
    • Line charts use geom_line();
    • Boxplots use the geom_boxplot();
    • Scatterplots use the geom_point();
    • Fitted lines use the geom_smooth();
    • and many more!
  • We can use different geom_*() to plot the same data.

Geometric Objects

Scatterplot

ggplot(data = mpg) + 
  geom_point(mapping = 
               aes(x = displ, 
                   y = hwy),
             alpha = .3)

Geometric Objects

Fitted lines

ggplot(data = mpg) + 
  geom_smooth(mapping = 
                aes(x = displ, 
                    y = hwy))

Geometric Objects

  • Every geom function in ggplot2 takes a mapping argument.

  • However, not every aesthetic works with every geom.

    • We could set the shape of a point, but you couldn’t set the shape of a line;
    • We could set the linetype of a line.

Geometric Objects

ggplot( data = mpg ) + 
  geom_smooth( mapping = 
                 aes( x = displ, 
                      y = hwy, 
                      linetype = drv) )

Geometric Objects

  • Setting method = lm manually in geom_smooth() gives a straight line that fits into data points.
ggplot( data = mpg ) + 
  geom_smooth( mapping = 
                 aes( x = displ, 
                      y = hwy),
               method = lm)

Geometric Objects

  • We can set the group aesthetic to a categorical variable to draw multiple objects.
    • ggplot2 will draw a separate object for each unique value of the grouping variable.

Geometric Objects

ggplot(data = mpg) +
  geom_smooth(mapping = 
                aes(x = displ, 
                    y = hwy))

Geometric Objects

ggplot(data = mpg) +
  geom_smooth(mapping = 
                aes(x = displ, 
                    y = hwy, 
                    group = drv))

Geometric Objects

  • In practice, ggplot2 will automatically group the data for these geoms whenever we map an aesthetic to a discrete variable (as in the linetype example).
ggplot(data = mpg) +
  geom_smooth(
    mapping = aes(x = displ, 
                  y = hwy, 
                  color = drv),
    show.legend = FALSE
  )

Geometric Objects

  • To display multiple geometric objects in the same plot, add multiple geom_*() functions to ggplot():
ggplot(data = mpg) + 
  geom_point(mapping = 
               aes(x = displ, 
                   y = hwy),
             alpha = .3) +
  geom_smooth(mapping = 
                aes(x = displ, 
                    y = hwy)) +
  geom_smooth(mapping = 
                aes(x = displ, 
                    y = hwy), 
              method = lm, 
              color = 'red')

  • Using geom_point(), geom_smooth(), and geom_smooth(method = lm) together is an excellent option to visualize the relationship between the two variables.

Geometric Objects

  • If we place mappings in a geom function, ggplot2 will treat them as local mappings for the layer.
ggplot(data = mpg, 
       mapping = 
         aes(x = displ, 
             y = hwy)) + 
  geom_point(mapping = 
               aes(color = class),
             alpha = .3) + 
  geom_smooth()

Geometric Objects

ggplot(data = mpg, 
       mapping = 
         aes(x = displ, 
             y = hwy)) + 
  geom_point(mapping = 
               aes(color = class), 
             alpha = .3) + 
  geom_smooth(data = 
                filter(mpg, 
                       class == "subcompact"), 
              se = FALSE)

Geometric Objects

  • We can use the same idea to specify different data for each layer.

  • Here, our smooth line displays just a subset of the mpg data.frame, the subcompact cars.

    • filter() is the tidyverse-way to filter observations in a data.frame.
  • The local data argument in geom_smooth() overrides the global data argument in ggplot() for that layer only.

  • The standard error (se) tells us how much the predicted values from a model might differ from the actual values we’re trying to predict.

    • It’s a way to measure how good a quality of prediction is.

Statistical Transformation

Statistical Transformation

  • Many graphs, including bar charts, calculate new values to plot:

    • geom_bar(), geom_histogram(), and geom_freqpoly() bin our data and then plot bin counts, the number of observations that fall in each bin.

    • geom_boxplot() computes a summary of the distribution and then display a specially formatted box.

    • geom_smooth() fits a model to our data and then plot predictions from the model.

Statistical Transformation

  • A histogram geom_histogram() is a continuous version of a bar chart.
ggplot(diamonds) +
  geom_histogram(aes(x = price))

Statistical Transformation

  • When using geom_histogram(), we should experiment on either bins or binwidth.
ggplot(diamonds) +
  geom_histogram(aes(x = price), 
                 bins = 200)

Statistical Transformation

  • A frequency polygon geom_freqpoly() is a line version of a histogram.
ggplot(diamonds) +
  geom_freqpoly(aes(x = price), 
                 bins = 200)

A Little Bit of Math for log()

  • The logarithm function, \(y = \log_{b}\,(\,x\,)\), looks like ….

  • \(\log_{10}\,(\,100\,)\): the base \(10\) logarithm of \(100\) is \(2\), because \(10^{2} = 100\)

  • \(\log_{e}\,(\,x\,)\): the base \(e\) logarithm is called the natural log, where \(e = 2.718\cdots\) is the mathematical constant, the Euler’s number.

  • \(\log\,(\,x\,)\) or \(\ln\,(\,x\,)\): the natural log of \(x\) .

  • \(\log_{e}\,(\,7.389\cdots\,)\): the natural log of \(7.389\cdots\) is \(2\), because \(e^{2} = 7.389\cdots\).

The use of log()

NYC Housing Sales

library(tidyverse)
sale_df <- read_csv(
  "https://bcdanl.github.io/data/home_sales_nyc.csv")
  • sale_df contains data for residential property sales from September 2017 and August 2018 in NYC.
    • Let’s focus on sale.price, a property’s sales price.

The use of log()

Percentage Change

  1. We should consider using a logarithmic scale when percent change, or change in orders of magnitude, is more important than changes in absolute units.
  • For small changes in variable \(x\) from \(x_{0}\) to \(x_{1}\), the following can be shown:

\[\Delta \log(x) \,= \, \log(x_{1}) \,-\, \log(x_{0}) \approx\, \frac{x_{1} \,-\, x_{0}}{x_{0}} \,=\, \frac{\Delta\, x}{x_{0}}.\]

  • For example, a difference in sale.price of $10,000 means something very different across people with different income/wealth levels.

The use of log()

Wide Range of Skewed Data

  1. We should consider using a log scale when a variable is heavily skewed.
  • It can help visualize both small and large values effectively.
ggplot(sale_df, 
       aes(x = sale.price), 
       bins = 500) +
  geom_histogram()

Statistical Transformation

  • Bar charts seem simple, but they are interesting because they reveal something subtle about plots.

  • Consider a basic bar chart, as drawn with geom_bar().

  • The following bar chart displays the total number of diamonds in the ggplot2::diamonds data.frame, grouped by cut.

  • The diamonds data.frame comes in ggplot2 and contains information about ~54,000 diamonds, including the price, carat, color, clarity, and cut of each diamond.

Statistical Transformation

  • geom_bar() bins our data and then plot bin counts, the number of observations that fall in each bin.
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut))

Statistical Transformation

  • The algorithm used to calculate new values for a graph is called a stat, short for statistical transformation.

  • The figure below describes how this process works with geom_bar().

Statistical Transformation

Observed Value vs. Number of Observations

  • There are three reasons we might need to use a stat explicitly:

Statistical Transformation

Observed Value vs. Number of Observations

  • 1. We might want to override the default stat.
# to make a simple data.frame
demo <- tribble(         
  ~cut,         ~freq,
  "Fair",       1610,
  "Good",       4906,
  "Very Good",  12082,
  "Premium",    13791,
  "Ideal",      21551 )

ggplot(data = demo) +
  geom_bar(mapping = 
             aes(x = cut, 
                 y = freq), 
           stat = "identity")

Statistical Transformation

Count vs. Proportion

  • 2. We might want to override the default mapping from transformed variables to aesthetics.
ggplot(data = diamonds) + 
  geom_bar(mapping = 
             aes(x = cut, 
                 y = stat(prop), 
                 group = 1))

Statistical Transformation

Stat summary

  • 3. We might want to draw greater attention to the statistical transformation in our code.
ggplot(data = diamonds) + 
  stat_summary(
    mapping = aes(x = cut, y = depth),
    fun.min = min,
    fun.max = max,
    fun = median
  )

Position Adjustment

Position Adjustment

color and fill aesthetic

  • We can color a bar chart using either the color aesthetic, or, more usefully, fill.
ggplot(data = diamonds) + 
  geom_bar(mapping = 
             aes(x = cut, 
                 color = cut))

Position Adjustment

fill

ggplot(data = diamonds) + 
  geom_bar(mapping = 
             aes(x = cut, 
                 fill = cut))

Position Adjustment

Stacked bar charts with fill aesthetic

  • Note that the bars are automatically stacked if we map the fill aesthetic to another variable.
ggplot(data = diamonds) + 
  geom_bar(mapping = 
             aes(x = cut, 
                 fill = clarity) )

Position Adjustment

Stacked bar charts with fill aesthetic

  • The stacking is performed automatically by the position adjustment specified by the position argument.
ggplot(data = diamonds) + 
  geom_bar(mapping = 
             aes(x = cut, 
                 fill = clarity),
           position = "stack")

Position Adjustment

  • If we don’t want a stacked bar chart with counts, we can use one of two other position options: fill or dodge.

  • position = "fill" works like stacking, but makes each set of stacked bars the same height.

    • This makes it easier to compare proportions across groups.
  • position = "dodge" places overlapping objects directly beside one another.

Position Adjustment

position = "fill"

ggplot(data = diamonds) + 
  geom_bar(mapping = 
             aes(x = cut, 
                 fill = clarity), 
           position = "fill")

Position Adjustment

position = "dodge"

ggplot(data = diamonds) + 
  geom_bar(mapping = 
             aes(x = cut, 
                 fill = clarity), 
           position = "dodge")

Coordinate Systems

Coordinate Systems

geom_abline()

  • What does geom_abline() do?
ggplot(data = mpg, 
       mapping = 
         aes(x = cty, 
             y = hwy)) + 
  geom_point() + 
  geom_abline() # math: y = ax + b

Coordinate Systems

coord_fixed()

  • What does coord_fixed() do?
ggplot(data = mpg, 
       mapping = 
         aes(x = cty, 
             y = hwy)) + 
  geom_point() + 
  geom_abline() +
  coord_fixed()

ggplot Grammar

The Layered Grammar of Graphics

  • Below summarizes the layered grammar of graphics—position adjustments, stats, coordinate systems, and faceting.
    • Additionally, we consider scales, themes, and guides!
ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(
     mapping = aes(<MAPPINGS>),
     stat = <STAT>, 
     position = <POSITION>) +
  <COORDINATE_FUNCTION> +
  <FACET_FUNCTION> +
  <SCALE_FUNCTIONS> +
  <GUIDES> +
  <THEME>