Lecture 2

Data Visualization with ggplot

Byeong-Hak Choe

bchoe@geneseo.edu

SUNY Geneseo

January 26, 2026

🚀 Data Visualization with `ggplot` - First Steps

Grammar of Graphics

A grammar of graphics is a tool that enables us to concisely describe the components of a graphic.

Data Visualization - First Steps

library(tidyverse)
mpg
?mpg

The mpg data frame, provided by ggplot2, contains observations collected by the US Environmental Protection Agency on 38 models of car.
Q. Do cars with big engines use more fuel than cars with small engines?
- displ: a car’s engine size, in liters.
- hwy: a car’s fuel efficiency on the highway, in miles per gallon (mpg).
What does the relationship between engine size and fuel efficiency look like?

Creating a Scatterplot with `ggplot`

ggplot( data = mpg,
        mapping = 
          aes(x = displ, 
              y = hwy) ) + 
  geom_point()

To plot mpg, run the above code to put displ on the x-axis and hwy on the y-axis.

Components in the Grammar of Graphics

ggplot( data = DATA.FRAME,
        mapping = 
          aes( MAPPINGS ) ) + 
  GEOM_FUNCTION()

A ggplot graphic is a mapping of variables in data to aesthetic attributes of geometric objects.
Three Essential Components in ggplot() Graphics:
1. data: data.frame containing the variables of interest.
2. geom_*(): geometric object in the plot (e.g., point, line, bar, histogram, boxplot).
3. aes(): aesthetic attributes of the geometric object (e.g., x-axis, y-axis, color, shape, size, fill) mapped to variables in the data.frame.

Creating a Scatterplot with `ggplot`

ggplot( data = mpg,
        mapping = 
          aes(x = displ, 
              y = hwy) ) + 
  geom_point()

Three Essential Components in This Particular ggplot():
1. data = mpg
2. geom_point()
3. aes(x = displ, y = hwy)

Relationship `ggplot()`

Scatterplot with `geom_point()`

ggplot( data = mpg,
        mapping = 
          aes(x = displ, 
              y = hwy) ) + 
  geom_point()

Fitted Curve with `geom_smooth()`

ggplot( data = mpg,
        mapping = 
          aes(x = displ, 
              y = hwy) ) + 
  geom_smooth()

`geom_point()` with `geom_smooth()`

# To add a layer of 
# a `ggplot()` component, 
# we can simply add it to 
# the `ggplot()` with `+`.

ggplot( data = mpg,
        mapping = 
          aes(x = displ, 
              y = hwy) ) + 
  geom_point() + 
  geom_smooth()

The geometric object geom_smooth() draws a smooth curve fitted to the data.

`ggplot()` workflow

Common problems in `ggplot()`

ggplot(data = mpg,
       mapping = 
          aes(x = displ, 
              y = hwy) ) +
 geom_point()
 + geom_smooth()

One common problem when creating ggplot2 graphics is to put the + in the wrong place.
- Correct Approach: Always place the + at the end of the previous line, NOT at the beginning of the next line.

About `geom_smooth()`

Using regression—one of the machine learning methods—the geom_smooth() visualizes the predicted value of the y variable for a given value of the x variable.
What Does the Grey Ribbon Represent?
- The grey ribbon illustrates the uncertainty around the estimated prediction curve.
- We are 95% confident that the actual relationship between x and y variables falls within the grey ribbon.

`geom_point()` with `geom_smooth(method = lm)`

ggplot( data = mpg,
        mapping = 
          aes(x = displ, 
              y = hwy) ) + 
  geom_point() + 
  geom_smooth(method = "lm")

method = "lm" specifies that a linear model (lm), called a linear regression model.

Relationship `ggplot()`

How many points are in this plot?
How many observations are in the mpg data.frame?

Overplotting problem

Many points overlap each other.
- This problem is known as overplotting.
When points overlap, it’s hard to know how many data points are at a particular location.
Overplotting can obscure patterns and outliers, leading to potentially misleading conclusions.

Overplotting and Transparency with `alpha`

# alpha = 0.33 should be located
# within the geom function,
# NOT within the aesthetic function

ggplot( data = mpg,
        mapping = 
          aes(x = displ, 
              y = hwy) ) + 
  geom_point( alpha = 0.33 )

We can set a transparency level (alpha) between 0 (full transparency) and 1 (no transparency) manually.

Overplotting and Transparency with `alpha`

ggplot( data = mpg,
        mapping = 
          aes(x = displ, 
              y = hwy) ) + 
  geom_point( alpha = .33 )

We can set an aesthetic property manually, as seen above, not within the aes() function but within the geom_*() function.

🎨✨ Aesthetic Mappings

Aesthetic Mappings

In the plot above, one group of points (highlighted in red) seems to fall outside of the linear trend.
- How can you explain these cars? Are those hybrids?

Aesthetic Mappings

An aesthetic is a visual property (e.g., size, shape, color) of the objects (e.g., class) in your plot.
You can display a point in different ways by changing the values of its aesthetic properties.

Adding a `color` to the Plot

ggplot( data = mpg,
        mapping = 
          aes(x = displ,
              y = hwy, 
              color = class) ) + 
  geom_point()

Adding a `shape` to the Plot

ggplot( data = mpg,
        mapping = 
          aes(x = displ,
              y = hwy, 
              shape = class) ) + 
  geom_point()

Adding a `size` to the Plot

ggplot( data = mpg,
        mapping =
          aes(x = displ,
              y = hwy,
              size = class) ) +
  geom_point()

Specifying a `color` to the Plot, Manually

ggplot(data = mpg,
       mapping = 
         aes(x = displ, 
             y = hwy) ) + 
  geom_point(color = "blue")

Specifying a `color` to the Plot, Manually

ggplot(data = mpg,
       mapping = 
         aes(x = displ, 
             y = hwy) ) + 
  geom_smooth(color = "darkorange")

Specifying a `fill` to the Plot, Manually

ggplot(data = mpg,
       mapping = 
         aes(x = displ, 
             y = hwy) ) + 
  geom_smooth(color = "darkorange",
              fill = "darkorange")

In general, each geom_*() has a different set of aesthetic parameters.
- E.g., fill is available for geom_smooth(), but not geom_point() in general (cf., Aesthetics Finder).

Specifying a `size` to the Plot, Manually

ggplot(data = mpg,
       mapping =
         aes(x = displ,
             y = hwy) ) +
  geom_point(size = 3)  # in *mm*

Specifying a `shape` to the Plot, Manually

ggplot(data = mpg,
       mapping =
         aes(x = displ,
             y = hwy) ) +
  # Integers 0-25
  geom_point(shape = 23)

Specifying a `shape` to the Plot, Manually

In ggplot2, point shapes are specified using numbers, as shown above.
- R provides 26 built-in point shapes, identified by integers 0–25.

🎨 How `color` and `fill` interact with `shapes`

Some shapes look similar (e.g., 0, 15, and 22 are all squares), but they behave differently depending on whether the shape supports color and/or fill.

Hollow shapes (0–14)
- Use color for the outline
- Do not use fill
Solid shapes (15–20)
- Use color for the entire point
- There is no separate fill

Filled shapes with outline (21–25)
- Use color for the border
- Use fill for the inside
- These are the most flexible when you want both an outline and a fill color

Specifying an `alpha` to the Plot, Manually

ggplot(data = mpg,
       mapping = 
         aes(x = displ, 
             y = hwy) ) + 
  geom_point(alpha = .3)

We’ve done this to address the issue of overplotting in the scatterplot.

Mapping Aesthetics vs. Setting Them Manually

Aesthetic Mapping

Links data variables to visible properties on the graph
Different categories → different colors or shapes

Setting Aesthetics Manually

Customize visual properties directly in geom_*() outside of aes()
Useful for setting fixed colors, sizes, or transparency unrelated to data variables

Multiple `data.frames` and `aes()` in ggplot

ggplot(data = mpg, 
       mapping = 
         aes(x = displ, y = hwy)) +
  geom_point(
    mapping = aes(color = class), 
    alpha = 0.3) +
  geom_smooth(
    # local data.frame
    data = mpg |> 
      filter(class == "subcompact"),
    se = FALSE   # standard error
  )

🌍 Global vs. Local `data.frame` in ggplot Layers

ggplot(data = mpg, 
       mapping = 
         aes(x = displ, y = hwy)) +
  geom_point(
    mapping = aes(color = class), 
    alpha = 0.3) +
  geom_smooth(
    # local data.frame
    data = mpg |> 
      filter(class == "subcompact"),  
    se = FALSE   # standard error
  )

In ggplot, you can set global settings in ggplot() (they apply to all layers),
then override them locally inside a specific geom_*() layer.

ggplot(data = mpg, ...) sets the global data to the full mpg data.frame.
geom_point(...) uses the full dataset (mpg).
geom_smooth(data = mpg |> filter(class == "subcompact"), ...) uses local data, so the smooth line is fit only to subcompact cars.
Key rule: a local data = ... inside a geom_*() layer overrides the global data = ...
for that layer only.

🌍 Global vs. Local `aes()` in ggplot Layers

ggplot(data = mpg, 
       mapping = 
         aes(x = displ, y = hwy)) +
  geom_point(
    mapping = aes(color = class), 
    alpha = 0.3) +
  geom_smooth(
    # local data.frame
    data = mpg |> 
      filter(class == "subcompact"),  
    se = FALSE   # standard error
  )

In ggplot, you can set global settings in ggplot() (they apply to all layers),
then override them locally inside a specific geom_*() layer.

ggplot(mapping = aes(...)) sets the global aesthetic mapping to all layers.
geom_point(aes(...), alpha = .3) adds local aesthetics that apply only to the points.
geom_smooth(se = F) adds a local aesthetic (se = F) that applies only to the smooth line.
Key rules: a local aes(...) inside a geom_*() layer adds to or overrides the global aes(...)
for that layer only.

🧹 Clutter is Your Enemy!

Cognitive Load

Every element on a slide or screen adds to the viewer’s cognitive load —
the mental effort required to process information.
The more elements you include, the more brainpower your audience must spend deciphering the message.
Example: Complex graphs, dense text, or excessive decoration can easily overwhelm viewers.
When cognitive load is too high, people lose focus or misinterpret the main idea.
🎯 Goal: Communicate as much insight as possible with the least mental effort required from the audience.

Why Reduce Clutter?

Clutter: Any visual element that takes up space without improving understanding.
Clutter distracts, slows comprehension, and obscures your main point.
Strive for clarity: Clean, purposeful visuals promote focus and engagement.
- Less clutter → clearer message → more attention to what matters.
✨ Practical Tips
- Keep data visually balanced — avoid crowding one side of the graph.
- Limit overlapping or superimposed elements (e.g., no more than 3–4 lines per plot).
- Use whitespace strategically to let key information breathe.
- Eliminate anything that doesn’t support your story.

Clutter is Your Enemy!

Which one do you prefer?

Log Transformation: Reducing Clutter

Problem: When data points are densely packed, patterns become hard to see.
- This often happens with extreme values or skewed distributions.
- Dense clusters create visual clutter, hiding meaningful relationships.
Solution: Apply a log transformation!
- 📏 Spreads out data — makes points more evenly distributed across the plot.
- 🔍 Reduces outlier impact, revealing clearer patterns and trends.
- 🧠 Improves interpretability — helps audiences grasp relationships faster.
- 🎯 Promotes focused, informative visualization without unnecessary complexity.

A Little Bit of Math for Logarithm

The logarithm function, $y = \log_{b}\,(\,x\,)$, looks like ….

🧮 A Little Bit of Math for Logarithms

Common Logarithm

💻 In R

log10(x)  # Common log (base 10)

$\color{blue}{\log_{10}(x)}$ The base-$10$ logarithm is called the common logarithm.
$\log_{10}(100)$
The base-10 logarithm of 100 is 2, because
$10^{2} = 100$.

Natural Logarithm

💻 In R

log(x)    # Natural log (base e)

$\log_{e}(x)$
The base-$e$ logarithm is called the natural logarithm,
where $e = 2.718\ldots$ is the Euler’s number.
$\color{blue}{\log(x)}$ or $\ln(x)$
Both denote the natural log of $x$.
$\log_{e}(7.389\ldots)$
The natural log of 7.389⋯ is 2, because
$e^{2} = 7.389\ldots$.

The Use of Logarithms: Handling Skewed Data

Consider a logarithmic scale when a variable is heavily skewed.
- It helps visualize both small and large values effectively.

Without Log Transformation

With Log Transformation

The Use of Logarithms: Focusing on Percentage Change

Consider a logarithmic scale when percentage changes are more meaningful than absolute changes.

Percentage changes are widely used to interpret relative differences:
- Stock prices — show proportional gains or losses relative to initial value.
- Housing prices — highlight comparable market trends across regions.
- GDP growth — expressed as a percentage to reflect economic performance.
- Income levels — a $1,000 increase matters more to lower-income individuals.

🧮 Logarithms and Percentage Change

For a small change in a variable $x$ from $x_{0}$ to $x_{1}$: \[ \Delta \log(x) = \log(x_{1}) - \log(x_{0}) \]

\[ \,\approx \frac{x_{1} - x_{0}}{x_{0}} \]

\[ = \frac{\Delta x}{x_{0}}\quad\; \]

This shows that a log transformation approximates percentage change when the change is small.
- Logs emphasize relative rather than absolute differences.
- ✅ Interpretation: An increase in log(x) of 0.01 corresponds to approximately a 1% increase in x.

🌍 Example: GDP per Capita vs. Life Expectancy

Linear Scale

Log Scale

✅ Slope = 8.4 : $\;$ A 1-unit increase of log(GDP per capita) is associated with an increase in life expectancy of about 8.4 years.
- A 1% increase in GDP per capita is associated with an increase in life expectancy of about 0.084 years (≈ 30.7 days).

Try it out → Classwork 2: Relationship Plots.

`facet_wrap(~ VAR)`

ggplot(data = gapminder,
       mapping = 
         aes(x = log10(gdpPercap), 
             y =lifeExp)) + 
  geom_point(alpha = .4) + 
  facet_wrap( ~ continent)

To facet our plot by a single variable, we can use facet_wrap().

`facet_wrap(~ VAR)` with `nrow`

ggplot(data = gapminder,
       mapping = 
         aes(x = log10(gdpPercap), 
             y =lifeExp)) + 
  geom_point(alpha = .4) + 
  facet_wrap( ~ continent,
              nrow = 1)

nrow determines the number of rows to use when laying out the facets.

`facet_wrap(~ VAR)` with `ncol`

ggplot(data = gapminder,
       mapping = 
         aes(x = log10(gdpPercap), 
             y =lifeExp)) + 
  geom_point(alpha = .4) + 
  facet_wrap( ~ continent,
              ncol = 1)

ncol determines the number of columns to use when laying out the facets.

`facet_wrap(~ VAR)` with `scales = "free_x"`

ggplot(data = gapminder,
       mapping = 
         aes(x = log10(gdpPercap), 
             y =lifeExp)) + 
  geom_point(alpha = .4) + 
  facet_wrap( ~ continent,
              scales = "free_x")

scales = "free_x" allow for different scales of x-axis

`facet_wrap(~ VAR)` with `scales = "free_y"`

ggplot(data = gapminder,
       mapping = 
         aes(x = log10(gdpPercap), 
             y =lifeExp)) + 
  geom_point(alpha = .4) + 
  facet_wrap( ~ continent,
              scales = "free_y")

scales = "free_y" allow for different scales of y-axis

`facet_wrap(~ VAR)` with `scales = "free"`

ggplot(data = gapminder,
       mapping = 
         aes(x = log10(gdpPercap), 
             y =lifeExp)) + 
  geom_point(alpha = .4) + 
  facet_wrap( ~ continent,
              scales = "free")

scales = "free" allow for different scales of both x-axis and y-axis
Try it out → Classwork 3: Color vs. Facet.

`facet_grid(ROW ~ COL)`

ggplot(data = gapminder,
       mapping =
         aes(x = log10(gdpPercap),
             y = lifeExp)) +
  geom_point(alpha = .4) +
  facet_grid(continent ~ year)

To facet by two variables (rows and columns), use facet_grid(ROW ~ COL).
Each row is a level of continent, and each column is a level of year.
nrow, ncol, and scales options also work with facet_grid()

Time Trend `ggplot()`

NVDA Stock Price Data

The nvda data.frame includes NVIDIA’s stock information from 2019-01-02 to 2025-10-30.

Scatterplot for Time Trend?

path <- 
  "https://bcdanl.github.io/data/nvda_2015_2025.csv"
nvda <- read_csv(path)

ggplot( data = nvda,
        mapping = aes(
          x = Date, 
          y = Close) ) + 
  geom_point(size = .5)

Line Chart with `geom_line()`

path <- 
  "https://bcdanl.github.io/data/nvda_2015_2025.csv"
nvda <- read_csv(path)

ggplot( data = nvda,
        mapping = aes(
          x = Date, 
          y = Close) ) + 
  geom_point(size = .5) +
  geom_line()

geom_line() draws a line by connecting data points in order of the variable on the x-axis.

The Connection Principle

We tend to think of objects that are physically connected as part of a group.
Look at this figure.
- Your eyes probably pair the shapes connected by lines rather than similar color, size, or shape!
We frequently leverage the connection principle is in line charts, to help our eyes see order in the data.

Line Chart with `geom_line()` and `geom_smooth()`

path <- 
  "https://bcdanl.github.io/data/nvda_2015_2025.csv"
nvda <- read_csv(path)

ggplot( data = nvda,
        mapping = aes(
          x = Date, 
          y = Close) ) + 
  geom_point(size = .5) +
  geom_line() +
  geom_smooth()

geom_smooth() helps reveal underlying long-term trends by smoothing out variability in the observations.

Tech Stocks’ Prices in October

The tech_october data.frame includes stock information about AAPL, MSFT, META, and NVDA in October 2025.

Time Trend of Tech Stock Price

tech_october <- 
  read_csv(
    "https://bcdanl.github.io/data/tech_stocks_2025_10.csv"
    )

ggplot( data = tech_october,
        mapping = aes(
          x = Date, 
          y = Close) ) + 
  geom_line()

Something has gone wrong. What happened?

Time Trend of Tech Stock Price

# `ggplot` needs to be 
# explicitly informed that 
# daily observations are grouped 
# by `Ticker`
# for it to understand 
# the grouping structure

ggplot( data = tech_october,
        mapping = aes(
          x = Date, 
          y = Close,
          color = Ticker) ) + 
  geom_line(size = 2) # thicker lines

We can use the group, color, or linetype aesthetic to tell ggplot about the firm-level grouping structure in the dataset.
Try it out → Classwork 4: Time Trend Plots.

Distribution `ggplot()` - Histograms

Histograms

Histograms are used to visualize the distribution of a numeric variable.
Histograms divide data into bins and count the number of observations in each bin.

Titanic Data

Histograms with `geom_histogram()`

titanic <- 
  read_csv(
    "https://bcdanl.github.io/data/titanic_cleaned.csv")

ggplot(data = titanic,
       mapping = 
         aes(x = age)) + 
  geom_histogram()

geom_histogram() creates a histogram.
- We map the x aesthetic to a variable.

`geom_histogram()` with `bins`

ggplot(data = titanic,
       mapping = 
         aes(x = age)) + 
  geom_histogram(bins = 5)

bins: Specifies the number of bins
Be careful: the number of bins can greatly influence the shape of a histogram.

`geom_histogram()` with `binwidth`

ggplot(data = titanic,
       mapping = 
         aes(x = age)) + 
  geom_histogram(binwidth = 1)

binwidth: Specifies the width of each bin
We choose either the bins option or the binwidth option.

Customizing the `color` and `fill` Aesthetics

ggplot(data = titanic,
       mapping = 
         aes(x = age)) + 
  geom_histogram(
    binwidth = 2,
    fill = 'lightblue',
    color = 'black')

fill: Fills the bars with a specific color.
color: Adds an outline of a specific color to the bars.

Distribution `ggplot()` - `geom_density()`

Kernel Density Plots

Kernel density plots visualize the distribution of a numeric variable, like a smoothed histogram.
Instead of counting observations in bins, geom_density() estimates a smooth curve of the distribution.
The y-axis is density (the curve area integrates to 1).

Kernel Density with `geom_density()`

ggplot(data = titanic,
       mapping = aes(x = age)) +
  geom_density()

geom_density() creates a kernel density curve.
- We map the x aesthetic to a numeric variable.

Customizing the `color` and `fill` Aesthetics

ggplot(titanic, aes(x = age)) +
  geom_density(
    fill  = "lightblue",
    color = "black",
    linewidth = 0.8
  )

fill: fills the area under the curve
color: controls the outline color of the curve
linewidth controls outline thickness

Mapping `color` (Outline) to a Group Variable

ggplot(titanic, 
       aes(x = age, 
           color = gender)) +
  geom_density(
    linewidth = 0.9
    )

Mapping color = gender draws multiple curves, one per group.

Mapping `fill` to a Group Variable (Overlapping Distributions)

ggplot(titanic, 
       aes(x = age, 
           fill = gender)) +
  geom_density(
    alpha = 0.35
    )

When multiple filled curves overlap, use transparency (alpha) so you can see both.

Mapping `color` to a Group Variable with `geom_freqpoly()`

ggplot(titanic,
       aes(x = age,
           color = gender)) +
  geom_freqpoly(
    bins = 30,
    linewidth = 0.9
  )

A frequency polygon is a line-based version of a histogram that connects the heights of binned counts across the x-axis.
- geom_freqpoly() does not support fill aesthetic.

Position Adjustments: Why They Matter

When plotting several densities, position controls how the groups are placed:

position = "identity" (default): curves overlap in the same coordinate system
position = "stack": stack densities on top of each other
position = "fill": stack and normalize so the total height equals 1 at each x

For density plots, position affects how overlapping group distributions are visually combined.

`position = "identity"` (Best for Comparison)

ggplot(titanic, 
       aes(x = age, 
           fill = gender)) +
  geom_density(
    position = "identity",  # default
    alpha = 0.35)

Often the best choice when you want to compare shapes across groups.
Use alpha to reduce clutter.

`position = "stack"` (Composition Emphasis)

ggplot(titanic, 
       aes(x = age, 
           fill = gender)) +
  geom_density(
    position = "stack"
    )

Highlights combined “mass” across groups.
Not ideal for comparing group shapes directly (because the baseline shifts).

`position = "fill"` (Relative Composition Across x)

ggplot(titanic, 
       aes(x = age, 
           fill = gender)) +
  geom_density(
    position = "fill"
    )

At each x-value, the stacked areas sum to 1 (100%).
Useful if you want “who dominates at each x value?”

`geom_density()` with `adjust`

ggplot(titanic, aes(x = age)) +
  geom_density(adjust = 0.6)

ggplot(titanic, aes(x = age)) +
  geom_density(adjust = 1.8)

Density smoothing depends on a bandwidth (how “wide” each kernel is).
You can tune smoothing with adjust (multiplies the default bandwidth)
- adjust < 1 makes the curve less smooth
- adjust > 1 makes the curve more smooth

Density vs Histogram with `fill`

Tip

Visualization tip:
To visualize several distributions at once, density plots will generally work better than histograms (because multiple histograms often overlap and get cluttered).

Density + Histogram Together

Sometimes it helps to overlay a density curve on top of a histogram.

ggplot(titanic, aes(x = age)) +
  geom_histogram(
    aes(y = after_stat(density)),
    bins = 20,
    fill = "lightgray",
    color = "white") +
  geom_density(
    linewidth = 1.0)

aes(y = after_stat(density)) rescales the histogram so the density curve matches the histogram scale.

🎨 Design with Colorblind in Mind

Design with Colorblind in Mind

Types of Colorblindness

About 8% of men and 0.5% of women experience some form of colorblindness.
To make visualizations more accessible and colorblind-friendly, consider:
1. Using colorblind-friendly palettes
2. Adding shape to scatterplots or linetype to line charts
3. Including additional visual cues to highlight important information (e.g., annotations or labels)

`ggthemes` package

install.packages("ggthemes")
library(ggthemes)

The ggthemes package offers a variety of additional themes and color scales for enhancing ggplot2 visualizations:
- Accessible color palettes, including options designed for colorblind-friendly viewing
  - For the color aesthetic mapping: scale_color_colorblind() or scale_color_tableau()
  - For the fill aesthetic mapping: scale_fill_colorblind() or
    scale_fill_tableau()
- Predefined thematic styles inspired by well-known publications and design aesthetics (e.g., theme_economist(), theme_wsj())

`ggthemes::scale_color_colorblind()`

ggplot( data = mpg,
        mapping = 
          aes(x = displ,
              y = hwy, 
              color = class) ) + 
  geom_point(size = 3) +
  scale_color_colorblind()

When mapping color in aes(), we can use scale_color_*()

`ggthemes::scale_color_tableau()`

ggplot( data = mpg,
        mapping = 
          aes(x = displ,
              y = hwy, 
              color = class) ) + 
  geom_point(size = 3) +
  scale_color_tableau()

scale_color_tableau() provides color palettes used in Tableau.

💡 Quick Detour

`ggthemes::theme_economist()`

ggplot( data = mpg,
        mapping = 
          aes(x = displ,
              y = hwy, 
              color = class) ) + 
  geom_point(size = 3) +
  scale_color_colorblind() +
  theme_economist()

theme_economist() approximates the style of The Economist.

💡 Quick Detour

`ggthemes::theme_wsj()`

ggplot( data = mpg,
        mapping = 
          aes(x = displ,
              y = hwy, 
              color = class) ) + 
  geom_point(size = 3) +
  scale_color_colorblind() +
  theme_wsj()

theme_wsj() approximates the style of The Wall Street Journal.

Distribution `ggplot()` - Boxplots

Boxplots

Boxplots visualize how the distribution of a numeric variable varies across levels of a categorical variable.
They display the median, quartiles, and potential outliers, providing a compact summary of the data.

Boxplots with `geom_boxplot()`

ggplot(data = mpg,
       mapping = 
         aes(x = class,
             y = hwy)) + 
  geom_boxplot()

geom_boxplot() creates a boxplot;
- Mappings: one numeric variable and one categorical variable to the x and y aesthetics

Horizontal Boxplots

ggplot(data = mpg,
       mapping = 
         aes(x = hwy,
             y = class)) + 
  geom_boxplot()

Boxplots can be displayed horizontally or vertically.
- A horizontal boxplot works well when category names are long.

Customizing the `fill` Aesthetic

# 1. `show.legend = FALSE` turns off 
#     the legend information
# 2. `scale_fill_colorblind()` or
#    `scale_fill_tableau()`
#     applies a color-blind friendly 
#     palette to the `fill` aesthetic
# install.packages("ggthemes")
library(ggthemes) 
ggplot(data = mpg,
       mapping = 
         aes(x = hwy,
             y = class,
             fill = class)) + 
  geom_boxplot(
    show.legend = FALSE) +
  scale_fill_tableau()

fill: Maps a variable to the fill colors used in the boxplot.
ggthemes::scale_fill_tableau(): A colorblind-friendly Tableau-style palette for the fill aesthetic.

Sorted Boxplots with `fct_reorder(CATEGORICAL, NUMERICAL)`

# labs() can label
#   x-axis, y-axis, and more

ggplot(data = mpg,
       mapping = 
        aes(x = hwy,
            y = 
             fct_reorder(class, hwy),
            fill = class)) + 
  geom_boxplot(
    show.legend = FALSE) +
  scale_fill_tableau() +
  labs(x = "Highway MPG",
       y = "Class")

fct_reorder(CATEGORICAL, NUMERICAL): Reorders the categories of the CATEGORICAL by the median of the NUMERICAL.

Distribution `ggplot()` - Bar Charts

Bar Charts

Bar charts are used to visualize the distribution of a categorical variable.
Bar charts display the count (or proportion) of observations for each category.

Diamond Data

ggplot2::diamonds is a data.frame containing the prices and other attributes of almost 54,000 diamonds.

Bar Charts with `geom_bar()`

ggplot(data = diamonds,
       mapping = aes(x = cut)) + 
  geom_bar()

geom_bar() creates a bar chart.
- We map either the x or y aesthetic to the variable.

Horizontal Bar Charts

ggplot(data = diamonds,
       mapping = aes(y = cut)) + 
  geom_bar()

Bar charts can be horizontal or vertical.
- A horizontal bar chart is a good option for long category names.

Data Transformation - `count()`: Counting Occurrences of Each Category in a Categorical Variable

The figure below demonstrates how the counting process works with geom_bar().

Colorful Bar Charts with the `fill` Aesthetic

ggplot(data = diamonds,
       mapping = 
         aes(x = cut, 
             fill = cut)) + 
  geom_bar(
    show.legend = FALSE
    )

We can color bar charts using the fill aesthetic.

Data Transformation - `count()`: Counting Occurrences Across Two Categorical Variables

DATA.FRAME |> count(CATEGORICAL_VARIABLE_1, CATEGORICAL_VARIABLE_2)

The data transformation function count() calculates the frequency of each unique combination of values across two categorical variables.

diamonds |> count(cut, clarity)

diamonds |> count(cut, clarity) returns the data.frame with the three variables, cut, clarity, and n:
- n: the number of occurrences of each unique combination of values in cut and clarity

Stacked Bar Charts with the `fill` Aesthetic

# Mapping the `fill` aesthetic 
# to other CATEGORICAL variable
# gives a stacked bar chart

ggplot(data = diamonds,
       mapping = 
         aes(x = cut, 
             fill = clarity)) + 
  geom_bar()

This describes how the distribution of clarity varies by cut, with total bar height for overall count and segments for each clarity level.

100% Stacked Bar Charts with the `fill` Aesthetic & `position="fill"`

ggplot(data = diamonds,
       mapping = 
         aes(x = cut, 
             fill = clarity)) + 
  geom_bar(position = "fill") +
  labs(y = "Proportion")

This describes how the distribution of clarity varies by cut, displaying the proportion of each clarity within each cut.

Clustered Bar Charts with the `fill` Aesthetic & `position="dodge"`

ggplot(data = diamonds,
       mapping = 
         aes(x = cut, 
             fill = clarity)) + 
  geom_bar(position = "dodge")

This shows how the distribution of clarity varies by cut, with separate bars for each clarity level within each cut category.

Clustered Bar Charts with the `fill` Aesthetic & `position_dodge2(preserve = "single")`

library(nycflights13)
ggplot(data = flights,
       mapping = 
         aes(y = carrier, 
             fill = origin)) + 
  geom_bar(position = "dodge")

ggplot(data = flights,
       mapping = 
         aes(y = carrier, 
             fill = origin)) + 
 geom_bar(position = position_dodge2(
               preserve = "single"))

position_dodge2() is useful when some groups have missing categories and you want spacing between bars.

Stacked Bar Charts using the `fill` Aesthetic and `position = "stack"`

ggplot(data = diamonds,
       mapping = 
         aes(x = cut, 
             fill = clarity)) + 
  geom_bar(position = "stack")

The default position option is position = "stack"

Proportion Bar Charts with `geom_bar()`

ggplot(data = diamonds,
       mapping = 
         aes(x = cut, 
             y = after_stat(prop),
             group = 1)) + 
  geom_bar()

after_stat(prop): Calculates the proportion of the total count.
group = 1: Ensures the proportions are calculated over the entire data.frame, not within each group of cut

Bar Charts with `geom_col()`

df <- mpg |> 
  count(class)

ggplot(data = df,
       mapping = 
         aes(x = n, 
             y = class)) + 
  geom_col()

geom_col() creates bar charts where the height of bars directly represents values in a column in a given data.frame.
- geom_col() requires both x- and y- aesthetics.

Sorted Bar Charts with `fct_reorder(CATEGORICAL, NUMERICAL)`

df <- mpg |> 
  count(class)

ggplot(data = df,
       mapping = 
         aes(x = n,
             y = 
               fct_reorder(class, n))
       ) + 
  geom_col() +
  labs(y = "Class")

fct_reorder(CATEGORICAL, NUMERICAL): Reorders the categories of the CATEGORICAL by the median of the NUMERICAL.

📊 Effective Distribution Charts for Category Comparisons

Choosing the Right Bar Charts for Comparing Components and Totals

Which type of bar chart is most effective for your data?
Which type of bar chart best meets your visualization goals?

Stacked Bar Charts

Purpose: Show the composition of each category while still conveying the overall total.
When to use: Your primary focus is the total bar height, with a secondary interest in how subcomponents contribute.
Be cautious: Precise comparisons across subcomponents are difficult because segments do not share a common baseline.

Tip: If you need to emphasize totals plus composition, a stacked bar chart is a strong choice.

100% Stacked Bar Charts

Purpose: Display the proportional composition of each category by normalizing all bars to the same height.
When to use: Your focus is on comparing relative percentages across categories rather than absolute totals.
Be cautious: While proportions are clear, absolute sizes or totals are not visible.

Tip: Choose a 100% stacked bar chart when you want to emphasize proportional differences across categories.

Clustered Bar Charts

Purpose: Plot each subcomponent as a separate bar within each category for side-by-side comparison.
When to use: Your focus is on comparing individual subcomponents across categories.
Be cautious: Can become crowded with many categories or subcomponents.

Tip: Choose a clustered bar chart when you want to emphasize precise comparisons between subcomponents—both within each cluster and across clusters.

Pie Charts: Alternative to Bar Charts?

Pie charts display the proportions of a whole.
- Each slice represents a part of the total.
- But is a pie chart an effective alternative to a bar chart?

Pie Charts: Alternative to Bar Charts?

Humans are generally better at judging lengths than angles.

Still, pie charts can be useful in certain cases.

Pie charts work best when there are very few categories—ideally four or fewer.
Pie charts are effective when the goal is to highlight simple, recognizable fractions (e.g., 25%, 50%, 75%).

Pie Charts: Alternative to Bar Charts?

Pie charts are not ideal when the audience needs to compare the size of individual shares across categories.

Pie charts are not ideal when the goal is to compare the overall distribution of categories.

🔢 Understanding and Visualizing Integer Data

Numeric or Categorical?

Treat as Numeric (Continuous)

When the integer reflects a magnitude, intensity, or a meaningful ordered scale.
Examples:
- Age (18, 19, 20, 21, …)
- MPG values (27, 28, 29, 30, …)
- Temperature readings (whole-number data)

Treat as Categorical (Discrete)

When the integer is simply a label for a category, and the numeric order does not represent meaningful numeric differences.
Examples:
- Month (1–12)
- Day of week (1–7)
- ZIP codes
- Student ID numbers

Historams or Bar Charts?

Histograms for the `Age` variable

Bar Charts for the `Age` variable

In ggplot, the distribution of an integer variable can look quite similar whether using geom_histogram() or geom_bar().
As shown above, in Python and other tools, these visualizations can behave differently, leading to noticeably different outputs.

Lecture 2

🚀 Data Visualization with ggplot - First Steps

Grammar of Graphics

Data Visualization - First Steps

Creating a Scatterplot with ggplot

Components in the Grammar of Graphics

Creating a Scatterplot with ggplot

Relationship ggplot()

Scatterplot with geom_point()

Fitted Curve with geom_smooth()

geom_point() with geom_smooth()

ggplot() workflow

Common problems in ggplot()

About geom_smooth()

geom_point() with geom_smooth(method = lm)

Relationship ggplot()

Overplotting problem

Overplotting and Transparency with alpha

Overplotting and Transparency with alpha

🎨✨ Aesthetic Mappings

Aesthetic Mappings

Aesthetic Mappings

Adding a color to the Plot

Adding a shape to the Plot

Adding a size to the Plot

Specifying a color to the Plot, Manually

Specifying a color to the Plot, Manually

Specifying a fill to the Plot, Manually

Specifying a size to the Plot, Manually

Specifying a shape to the Plot, Manually

Specifying a shape to the Plot, Manually

🎨 How color and fill interact with shapes

Specifying an alpha to the Plot, Manually

Mapping Aesthetics vs. Setting Them Manually

Aesthetic Mapping

Setting Aesthetics Manually

Multiple data.frames and aes() in ggplot

🌍 Global vs. Local data.frame in ggplot Layers

🌍 Global vs. Local aes() in ggplot Layers

🧹 Clutter is Your Enemy!

Cognitive Load

Why Reduce Clutter?

Clutter is Your Enemy!

Log Transformation: Reducing Clutter

A Little Bit of Math for Logarithm

🧮 A Little Bit of Math for Logarithms

Common Logarithm

💻 In R

Natural Logarithm

💻 In R

The Use of Logarithms: Handling Skewed Data

Without Log Transformation

With Log Transformation

The Use of Logarithms: Focusing on Percentage Change

🧮 Logarithms and Percentage Change

🌍 Example: GDP per Capita vs. Life Expectancy

Linear Scale

Log Scale

🖼️🖼️ Facets

Facets

facet_wrap(~ VAR)

facet_wrap(~ VAR) with nrow

facet_wrap(~ VAR) with ncol

facet_wrap(~ VAR) with scales = "free_x"

facet_wrap(~ VAR) with scales = "free_y"

facet_wrap(~ VAR) with scales = "free"

facet_grid(ROW ~ COL)

Time Trend ggplot()

NVDA Stock Price Data

Scatterplot for Time Trend?

Line Chart with geom_line()

The Connection Principle

Line Chart with geom_line() and geom_smooth()

Tech Stocks’ Prices in October

Time Trend of Tech Stock Price

Time Trend of Tech Stock Price

Distribution ggplot() - Histograms

Histograms

Titanic Data

Histograms with geom_histogram()

🚀 Data Visualization with `ggplot` - First Steps

Creating a Scatterplot with `ggplot`

Creating a Scatterplot with `ggplot`

Relationship `ggplot()`

Scatterplot with `geom_point()`

Fitted Curve with `geom_smooth()`

`geom_point()` with `geom_smooth()`

`ggplot()` workflow

Common problems in `ggplot()`

About `geom_smooth()`

`geom_point()` with `geom_smooth(method = lm)`

Relationship `ggplot()`

Overplotting and Transparency with `alpha`

Overplotting and Transparency with `alpha`

Adding a `color` to the Plot

Adding a `shape` to the Plot

Adding a `size` to the Plot

Specifying a `color` to the Plot, Manually

Specifying a `color` to the Plot, Manually

Specifying a `fill` to the Plot, Manually

Specifying a `size` to the Plot, Manually

Specifying a `shape` to the Plot, Manually

Specifying a `shape` to the Plot, Manually

🎨 How `color` and `fill` interact with `shapes`

Specifying an `alpha` to the Plot, Manually

Multiple `data.frames` and `aes()` in ggplot

🌍 Global vs. Local `data.frame` in ggplot Layers

🌍 Global vs. Local `aes()` in ggplot Layers

`facet_wrap(~ VAR)`

`facet_wrap(~ VAR)` with `nrow`

`facet_wrap(~ VAR)` with `ncol`

`facet_wrap(~ VAR)` with `scales = "free_x"`

`facet_wrap(~ VAR)` with `scales = "free_y"`

`facet_wrap(~ VAR)` with `scales = "free"`

`facet_grid(ROW ~ COL)`

Time Trend `ggplot()`

Line Chart with `geom_line()`

Line Chart with `geom_line()` and `geom_smooth()`

Distribution `ggplot()` - Histograms

Histograms with `geom_histogram()`

`geom_histogram()` with `bins`

`geom_histogram()` with `binwidth`

Customizing the `color` and `fill` Aesthetics

Distribution `ggplot()` - `geom_density()`

Kernel Density with `geom_density()`

Customizing the `color` and `fill` Aesthetics

Mapping `color` (Outline) to a Group Variable

Mapping `fill` to a Group Variable (Overlapping Distributions)

Mapping `color` to a Group Variable with `geom_freqpoly()`

`position = "identity"` (Best for Comparison)

`position = "stack"` (Composition Emphasis)

`position = "fill"` (Relative Composition Across x)

`geom_density()` with `adjust`

Density vs Histogram with `fill`

`ggthemes` package

`ggthemes::scale_color_colorblind()`

`ggthemes::scale_color_tableau()`

`ggthemes::theme_economist()`

`ggthemes::theme_wsj()`

Distribution `ggplot()` - Boxplots

Boxplots with `geom_boxplot()`

Customizing the `fill` Aesthetic

Sorted Boxplots with `fct_reorder(CATEGORICAL, NUMERICAL)`

Distribution `ggplot()` - Bar Charts

Bar Charts with `geom_bar()`

Data Transformation - `count()`: Counting Occurrences of Each Category in a Categorical Variable

Colorful Bar Charts with the `fill` Aesthetic

Data Transformation - `count()`: Counting Occurrences Across Two Categorical Variables

Stacked Bar Charts with the `fill` Aesthetic

100% Stacked Bar Charts with the `fill` Aesthetic & `position="fill"`

Clustered Bar Charts with the `fill` Aesthetic & `position="dodge"`

Clustered Bar Charts with the `fill` Aesthetic & `position_dodge2(preserve = "single")`

Stacked Bar Charts using the `fill` Aesthetic and `position = "stack"`

Proportion Bar Charts with `geom_bar()`

Bar Charts with `geom_col()`

Sorted Bar Charts with `fct_reorder(CATEGORICAL, NUMERICAL)`

Histograms for the `Age` variable

Bar Charts for the `Age` variable