Lecture 17

Data Visualization with ggplot; Relationship ggplot()

Byeong-Hak Choe

SUNY Geneseo

October 29, 2024

Data Visualization with ggplot

Grammar of Graphics

  • A grammar of graphics is a tool that enables us to concisely describe the components of a graphic.

Data Visualization - First Steps

library(tidyverse)
mpg
?mpg
  • The mpg data frame, provided by ggplot2, contains observations collected by the US Environmental Protection Agency on 38 models of car.

  • Q. Do cars with big engines use more fuel than cars with small engines?

    • displ: a car’s engine size, in liters.
    • hwy: a car’s fuel efficiency on the highway, in miles per gallon (mpg).
  • What does the relationship between engine size and fuel efficiency look like?

Data Visualization - First Steps

Creating a Scatterplot with ggplot

ggplot( data = mpg,
        mapping = 
          aes(x = displ, 
              y = hwy) ) + 
  geom_point()

  • To plot mpg, run the above code to put displ on the x-axis and hwy on the y-axis.

Data Visualization - First Steps

Components in the Grammar of Graphics

ggplot( data = DATA.FRAME,
        mapping = 
          aes( MAPPINGS ) ) + 
  GEOM_FUNCTION()
  • A ggplot graphic is a mapping of variables in data to aesthetic attributes of geometric objects.

  • Three Essential Components in ggplot() Graphics:

    1. data: data.frame containing the variables of interest.
    2. geom_*(): geometric object in the plot (e.g., point, line, bar, histogram, boxplot).
    3. aes(): aesthetic attributes of the geometric object (e.g., x-axis, y-axis, color, shape, size, fill) mapped to variables in the data.frame.

Data Visualization - First Steps

Creating a Scatterplot with ggplot

ggplot( data = mpg,
        mapping = 
          aes(x = displ, 
              y = hwy) ) + 
  geom_point()
  • Three Essential Components in This Particular ggplot():
    1. data = mpg
    2. geom_point()
    3. aes(x = displ, y = hwy)

Relationship ggplot()

Relationship ggplot()

Scatterplot with geom_point()

ggplot( data = mpg,
        mapping = 
          aes(x = displ, 
              y = hwy) ) + 
  geom_point()

Relationship ggplot()

Fitted Curve with geom_smooth()

ggplot( data = mpg,
        mapping = 
          aes(x = displ, 
              y = hwy) ) + 
  geom_smooth()

Relationship ggplot()

geom_point() with geom_smooth()

# To add a layer of 
# a `ggplot()` component, 
# we can simply add it to 
# the `ggplot()` with `+`.

ggplot( data = mpg,
        mapping = 
          aes(x = displ, 
              y = hwy) ) + 
  geom_point() + 
  geom_smooth()

  • The geometric object geom_smooth() draws a smooth curve fitted to the data.

ggplot() workflow

Common problems in ggplot()

ggplot(data = mpg,
       mapping = 
          aes(x = displ, 
              y = hwy) ) +
 geom_point()
 + geom_smooth()
  • One common problem when creating ggplot2 graphics is to put the + in the wrong place.
    • Correct Approach: Always place the + at the end of the previous line, NOT at the beginning of the next line.

Relationship ggplot()

About geom_smooth()

  • Using regression—one of the machine learning methods—the geom_smooth() visualizes the predicted value of the y variable for a given value of the x variable.

  • What Does the Grey Ribbon Represent?

    • The grey ribbon illustrates the uncertainty around the estimated prediction curve.
    • We are 95% confident that the actual relationship between x and y variables falls within the grey ribbon.

Relationship ggplot()

geom_point() with geom_smooth(method = lm)

ggplot( data = mpg,
        mapping = 
          aes(x = displ, 
              y = hwy) ) + 
  geom_point() + 
  geom_smooth(method = "lm")

  • method = "lm" specifies that a linear model (lm), called a linear regression model.

Relationship ggplot()

  • How many points are in this plot?
  • How many observations are in the mpg data.frame?

Relationship ggplot()

Overplotting problem

  • Many points overlap each other.

    • This problem is known as overplotting.
  • When points overlap, it’s hard to know how many data points are at a particular location.

  • Overplotting can obscure patterns and outliers, leading to potentially misleading conclusions.

Relationship ggplot()

Overplotting and Transparency with alpha

# alpha = 0.33 should be located
# within the geom function,
# NOT within the aesthetic function

ggplot( data = mpg,
        mapping = 
          aes(x = displ, 
              y = hwy) ) + 
  geom_point( alpha = 0.33 ) 

  • We can set a transparency level (alpha) between 0 (full transparency) and 1 (no transparency) manually.

Relationship ggplot()

Overplotting and Transparency with alpha

ggplot( data = mpg,
        mapping = 
          aes(x = displ, 
              y = hwy) ) + 
  geom_point( alpha = .33 )

  • We can set an aesthetic property manually, as seen above, not within the aes() function but within the geom_*() function.

Relationship ggplot()

  • Be mindful of the variable placement on the axes.
    • It’s common practice to place the input variable along the x-axis and the outcome variable along the y-axis.
  • Input Variable: Represents the potential “cause.”
  • Outcome Variable: Represents the potential “effect.”
  • Example: Advertising budget (input) and sales revenue (outcome).

Relationship ggplot()

Correlation does not imply causation

  • Just because you uncover a relationship doesn’t mean you’ve identified the “causal” relationship.

Relationship ggplot()

  • Caution: Correlation does not imply causation
    • A strong correlation does not imply that one variable causes the other to change.
  • Correlation measures the strength and direction of a linear relationship between two variables.
    • Positive/negative correlation
    • Strong/weak correlation
    • No correlation
  • Causation: Indicates that one variable directly influences or causes a change in another
    • Establishing causation requires controlled experimentation or additional evidence
    • E.g., Smoking causes an increase in lung cancer risk (causation)