Lecture 22

Relationship ggplot()

Byeong-Hak Choe

SUNY Geneseo

October 28, 2024

Relationship ggplot()

Relationship ggplot()

Scatterplot with geom_point()

ggplot( data = mpg,
        mapping = 
          aes(x = displ, 
              y = hwy) ) + 
  geom_point()

Relationship ggplot()

Fitted Curve with geom_smooth()

ggplot( data = mpg,
        mapping = 
          aes(x = displ, 
              y = hwy) ) + 
  geom_smooth()

Relationship ggplot()

geom_point() with geom_smooth()

# To add a layer of 
# a `ggplot()` component, 
# we can simply add it to 
# the `ggplot()` with `+`.

ggplot( data = mpg,
        mapping = 
          aes(x = displ, 
              y = hwy) ) + 
  geom_point() + 
  geom_smooth()

  • The geometric object geom_smooth() draws a smooth curve fitted to the data.

ggplot() workflow

Common problems in ggplot()

ggplot(data = mpg,
       mapping = 
          aes(x = displ, 
              y = hwy) ) +
 geom_point()
 + geom_smooth()
  • One common problem when creating ggplot2 graphics is to put the + in the wrong place.
    • Correct Approach: Always place the + at the end of the previous line, NOT at the beginning of the next line.

Relationship ggplot()

About geom_smooth()

  • Using regression—one of the machine learning methods—the geom_smooth() visualizes the predicted value of the y variable for a given value of the x variable.

  • What Does the Grey Ribbon Represent?

    • The grey ribbon illustrates the uncertainty around the estimated prediction curve.
    • We are 95% confident that the actual relationship between x and y variables falls within the grey ribbon.

Relationship ggplot()

geom_point() with geom_smooth(method = lm)

ggplot( data = mpg,
        mapping = 
          aes(x = displ, 
              y = hwy) ) + 
  geom_point() + 
  geom_smooth(method = "lm")

  • method = "lm" specifies that a linear model (lm), called a linear regression model.

Relationship ggplot()

  • How many points are in this plot?
  • How many observations are in the mpg data.frame?

Relationship ggplot()

Overplotting problem

  • Many points overlap each other.

    • This problem is known as overplotting.
  • When points overlap, it’s hard to know how many data points are at a particular location.

  • Overplotting can obscure patterns and outliers, leading to potentially misleading conclusions.

Relationship ggplot()

Overplotting and Transparency with alpha

# alpha = 0.33 should be located
# within the geom function,
# NOT within the aesthetic function

ggplot( data = mpg,
        mapping = 
          aes(x = displ, 
              y = hwy) ) + 
  geom_point( alpha = 0.33 ) 

  • We can set a transparency level (alpha) between 0 (full transparency) and 1 (no transparency) manually.

Relationship ggplot()

Overplotting and Transparency with alpha

ggplot( data = mpg,
        mapping = 
          aes(x = displ, 
              y = hwy) ) + 
  geom_point( alpha = .33 )

  • We can set an aesthetic property manually, as seen above, not within the aes() function but within the geom_*() function.

Relationship ggplot()

  • Be mindful of the variable placement on the axes.
    • It’s common practice to place the input variable along the x-axis and the outcome variable along the y-axis.
  • Input Variable: Represents the potential “cause.”
  • Outcome Variable: Represents the potential “effect.”
  • Example: Advertising budget (input) and sales revenue (outcome).

Relationship ggplot()

Correlation does not imply causation

  • Just because you uncover a relationship doesn’t mean you’ve identified the “causal” relationship.

Relationship ggplot()

  • Caution: Correlation does not imply causation
    • A strong correlation does not imply that one variable causes the other to change.
  • Correlation measures the strength and direction of a linear relationship between two variables.
    • Positive/negative correlation
    • Strong/weak correlation
    • No correlation
  • Causation: Indicates that one variable directly influences or causes a change in another
    • Establishing causation requires controlled experimentation or additional evidence
    • E.g., Smoking causes an increase in lung cancer risk (causation)