Lecture 12

Data Visualization - Geometric Objects

Byeong-Hak Choe

SUNY Geneseo

March 7, 2024

RStudio Workflow

Must-know Quarto Shortcuts

Mac

  • option + command + I: to create a R chunk
  • option + - : the shortcut for <-
  • command + return runs a current line or selected lines
  • command + shift + return: to run the code in the R chunk
  • command + shift + C: to comment out a line
  • command + shift + K: to render a current Quarto file

Windows

  • Alt+Ctrl+I : to create a R chunk
  • Alt + - : the shortcut for <-
  • Ctrl + Enter runs a current line or selected lines
  • Ctrl + Shift + Enter : to run the code in the R chunk
  • Ctrl + Shift + C: to comment out a line
  • Ctrl + Shift + K: to render a current Quarto file

Documenting Workflow

Must-know Shortcuts

  • Use the shortcuts below whenever you edit a document, including Quarto:

Mac

  • command + ⬆️/⬇️/⬅️/➡️

  • shift + ⬆️/⬇️/⬅️/➡️

  • command + shift + ⬆️/⬇️/⬅️/➡️

  • command + PgUp/PgDn

  • shift + PgUp/PgDn

  • command + shift + PgUp/PgDn:

Windows

  • Ctrl + ⬆️/⬇️/⬅️/➡️

  • Shift + ⬆️/⬇️/⬅️/➡️

  • Ctrl + shift + ⬆️/⬇️/⬅️/➡️

  • Ctrl + PgUp/PgDn

  • Shift + PgUp/PgDn

  • Ctrl + Shift + PgUp/PgDn:

ggplot Basics

Learning Objectives

  • Relationship Plots
  • Distribution Plots
  • Time Trend Plots
  • Geometric Objects

Data Visualization

Facets

Making Discoveries from Data Visualization

Making Discoveries from Data Visualization

Key Points in Data Visualization

  • A graphic should display as much information as it can, with the lowest possible cognitive strain to the viewer.

  • Strive for clarity.

    • Make the data stand out. Specific tips for increasing clarity include these:
      • Avoid too many superimposed elements, such as too many curves in the same graphing space.
      • Avoid having the data all skewed to one side or the other of your graph.
  • Visualization is an iterative process.

    • We should try making data visualization informative as much as we can.

Making Discoveries from Data Visualization

Geometric Objects

  • A geom_*() is the geometrical object that a plot uses to represent data.
    • Bar charts use geom_bar() or geom_col();
    • Histograms use geom_histogram() or geom_freqpoly();
    • Line charts use geom_line();
    • Boxplots use the geom_boxplot();
    • Scatterplots use the geom_point();
    • Fitted lines use the geom_smooth();
    • and many more!
  • We can use different geom_*() to plot the same data.

Making Discoveries from Data Visualization

Relationship

  • From the plots with two or more variables, we want to see co-variation, the tendency for the values of two or more variables to vary together in a related way.

  • What type of co-variation occurs between variables?

    • Are they positively associated?
    • Are they negatively associated?
    • Are there no association between them?

Making Discoveries from Data Visualization

Describing a Relationship with Scatterplots

ggplot(data = mpg) + 
  geom_point(mapping = 
               aes(x = displ, 
                   y = hwy),
             alpha = .3)

Making Discoveries from Data Visualization

Describing a Relationship with Fitted lines

ggplot(data = mpg) + 
  geom_smooth(mapping = 
                aes(x = displ, 
                    y = hwy))

Making Discoveries from Data Visualization

Describing a Relationship with Scatterplots plus Fitted lines

ggplot(data = mpg) + 
  geom_point(mapping = 
               aes(x = displ, 
                   y = hwy),
             alpha = .3) +
  geom_smooth(mapping = 
                aes(x = displ, 
                    y = hwy))

Making Discoveries from Data Visualization

Distribution

  • A distribution of a variable refers to the way its values are spread or arranged across a data.frame.
    • What type of variation occurs within a variable?
  • Variation refers to how much the values are spread out.
    • Which values are the most common? Why?
      • The mode of a variable is the value that appears most frequently within the set of that variable’s values.
    • Which range of values are relatively rare? Why? Does that match your expectations?

Making Discoveries from Data Visualization

Distribution

  • For a distribution of a continuous variable or an ordered categorical variable, we can consider a measure of the asymmetry of the distribution—skewness.

Making Discoveries from Data Visualization

Describing a Dristirubion with Histograms

ggplot(data = mpg) + 
  geom_histogram(mapping = 
                aes(x = displ))

Making Discoveries from Data Visualization

Describing a Dristirubion with Histograms

ggplot(data = mpg) + 
  geom_freqpoly(mapping = 
                aes(x = displ))

Making Discoveries from Data Visualization

Describing a Dristirubion with Histograms

ggplot(data = mpg) + 
  geom_histogram(mapping = 
                aes(x = displ, bins = 5))

Making Discoveries from Data Visualization

Describing a Dristirubion with Histograms

ggplot(data = mpg) + 
  geom_histogram(mapping = 
                aes(x = displ, binwidth = 1))

Making Discoveries from Data Visualization

Describing a Dristirubion with Boxplots

ggplot(data = mpg) + 
  geom_boxplot(mapping = 
                aes(x = class, 
                    y = hwy))

  • geom_boxplot() is used to create box plots (also known as box-and-whisker plots).
    • Box plots provide a visual summary of the distribution of a variable, showing the median, quartiles, and potential outliers.

Making Discoveries from Data Visualization

Describing a Dristirubion with Bar charts

ggplot(data = diamonds) + 
  geom_bar(mapping = 
             aes(x = cut))

  • The diamonds data.frame comes in ggplot2 and contains information about ~54,000 diamonds, including cut variable.

Making Discoveries from Data Visualization

Describing a Dristirubion with Bar charts

  • The figure below describes how geom_bar() transforms the data.frame.

Making Discoveries from Data Visualization

Time Trend

  • A time trend plot, (also known as a time series plot), is used to visualize trends, patterns, and fluctuations in a variable over a specific time period.

    • The x-axis typically represents time, while the y-axis represents the variable being measured.
  • We can check the overall direction in which the time-series variable are moving—upwards, downwards, or staying relatively constant over time.

Making Discoveries from Data Visualization

Describing Time Trend with Line charts

path <- 
  "https://bcdanl.github.io/data/NVDA.csv"
nvda <- read_csv(path)

ggplot(data = nvda) + 
  geom_line(mapping = 
                aes(x = Date, 
                    y = Close))

  • The nvda data.frame includes NVIDIA’s stock information from 2019-01-02 to 2024-03-04.

Making Discoveries from Data Visualization

Data Visualization with ggplot()

  • Step 1. Figure out whether variables of interests are categorical or continuous.

  • Step 2. Think which geometric objects, aesthetic mappings, and faceting are appropriate to visualize distributions and relationships.

  • Step 3. If needed, transform a given data.frame (e.g., filtered observations, new variables, summarized data) and try new visualizations.

Making Discoveries from Data Visualization

Geometric objects

  • A distribution of a categorical variable (e.g., geom_bar() and more)
  • A distribution of a continuous variable (e.g., geom_histogram() and more)
  • A relationship between two categorical variables (e.g., geom_bar() and more)
  • A relationship between two continuous variables (e.g., geom_point() with geom_smooth() and more)
  • A relationship between a categorical variable and a continuous variable (e.g., geom_boxplot() and more)
  • A time trend of a categorical variable (e.g., geom_bar() and more)
  • A time trend of a continuous variable (e.g., geom_line() and more)

Geometric Objects

Geometric Objects

  • Every geom function in ggplot2 takes a mapping argument.

  • However, not every aesthetic works with every geom.

    • We could set the shape of a point, but we could not set the shape of a line;
    • We could set the linetype of a line.

Geometric Objects

geom_smooth()

ggplot( data = mpg ) + 
  geom_smooth( mapping = 
                 aes( x = displ, 
                      y = hwy, 
                      linetype = drv) )

Geometric Objects

geom_smooth(method = lm)

  • Setting method = lm manually in geom_smooth() gives a straight line that fits into data points.
ggplot( data = mpg ) + 
  geom_smooth( mapping = 
                 aes( x = displ, 
                      y = hwy),
               method = lm)

Geometric Objects

geom_smooth(group = CATEGORICAL_VAR)

  • We can set the group aesthetic to a categorical variable to draw multiple objects.
    • ggplot2 will draw a separate object for each unique value of the grouping variable.

Geometric Objects

geom_smooth(group = CATEGORICAL_VAR)

ggplot(data = mpg) +
  geom_smooth(mapping = 
                aes(x = displ, 
                    y = hwy))

Geometric Objects

geom_smooth(group = CATEGORICAL_VAR)

ggplot(data = mpg) +
  geom_smooth(mapping = 
                aes(x = displ, 
                    y = hwy, 
                    group = drv))

Geometric Objects

  • In practice, ggplot2 will automatically group the data for these geoms whenever we map an aesthetic to a categorical variable (as in the linetype example).
ggplot(data = mpg) +
  geom_smooth(
    mapping = aes(x = displ, 
                  y = hwy, 
                  color = drv),
    show.legend = FALSE
  )

Geometric Objects

  • To display multiple geometric objects in the same plot, add multiple geom_*() functions to ggplot():
ggplot(data = mpg) + 
  geom_point(mapping = 
               aes(x = displ, 
                   y = hwy),
             alpha = .3) +
  geom_smooth(mapping = 
                aes(x = displ, 
                    y = hwy)) +
  geom_smooth(mapping = 
                aes(x = displ, 
                    y = hwy), 
              method = lm, 
              color = 'red')

Geometric Objects

Layers

  • Using geom_point(), geom_smooth(), and geom_smooth(method = lm) together is an excellent option to visualize the relationship between the two variables.

Geometric Objects

Layers

  • If we place mappings in a geom function, ggplot2 will treat them as local mappings for the layer.
ggplot(data = mpg, 
       mapping = 
         aes(x = displ, 
             y = hwy)) + 
  geom_point(mapping = 
               aes(color = class),
             alpha = .3) + 
  geom_smooth()

Geometric Objects

Multiple data.frames

df_subcompact <- filter(mpg, class == "subcompact")
  • We can use the same idea to specify different data for each layer.

  • Here, our smooth line displays just a subset of the mpg data.frame, the subcompact cars.

    • filter() is the tidyverse-way to filter observations in a data.frame.
  • The local data argument in geom_smooth() overrides the global data argument in ggplot() for that layer only.

Geometric Objects

Multiple data.frames

ggplot(data = mpg, 
       mapping = 
         aes(x = displ, 
             y = hwy)) + 
  geom_point(mapping = 
               aes(color = class), 
             alpha = .3) + 
  geom_smooth(data = df_subcompact, 
              se = FALSE)

  • The standard error (se) tells us how much the predicted values from a model might differ from the actual values we’re trying to predict.

Data Visualization

Geometric Objects