Lecture 10

Data Visualization with ggplot

Byeong-Hak Choe

SUNY Geneseo

November 5, 2025

🚀 Data Visualization with ggplot - First Steps

Grammar of Graphics

  • A grammar of graphics is a tool that enables us to concisely describe the components of a graphic.

Data Visualization - First Steps

library(tidyverse)
mpg
?mpg
  • The mpg data frame, provided by ggplot2, contains observations collected by the US Environmental Protection Agency on 38 models of car.

  • Q. Do cars with big engines use more fuel than cars with small engines?

    • displ: a car’s engine size, in liters.
    • hwy: a car’s fuel efficiency on the highway, in miles per gallon (mpg).
  • What does the relationship between engine size and fuel efficiency look like?

Creating a Scatterplot with ggplot

ggplot( data = mpg,
        mapping = 
          aes(x = displ, 
              y = hwy) ) + 
  geom_point()

  • To plot mpg, run the above code to put displ on the x-axis and hwy on the y-axis.

Components in the Grammar of Graphics

ggplot( data = DATA.FRAME,
        mapping = 
          aes( MAPPINGS ) ) + 
  GEOM_FUNCTION()
  • A ggplot graphic is a mapping of variables in data to aesthetic attributes of geometric objects.

  • Three Essential Components in ggplot() Graphics:

    1. data: data.frame containing the variables of interest.
    2. geom_*(): geometric object in the plot (e.g., point, line, bar, histogram, boxplot).
    3. aes(): aesthetic attributes of the geometric object (e.g., x-axis, y-axis, color, shape, size, fill) mapped to variables in the data.frame.

Creating a Scatterplot with ggplot

ggplot( data = mpg,
        mapping = 
          aes(x = displ, 
              y = hwy) ) + 
  geom_point()
  • Three Essential Components in This Particular ggplot():
    1. data = mpg
    2. geom_point()
    3. aes(x = displ, y = hwy)

Relationship ggplot()

Scatterplot with geom_point()

ggplot( data = mpg,
        mapping = 
          aes(x = displ, 
              y = hwy) ) + 
  geom_point()

Fitted Curve with geom_smooth()

ggplot( data = mpg,
        mapping = 
          aes(x = displ, 
              y = hwy) ) + 
  geom_smooth()

geom_point() with geom_smooth()

# To add a layer of 
# a `ggplot()` component, 
# we can simply add it to 
# the `ggplot()` with `+`.

ggplot( data = mpg,
        mapping = 
          aes(x = displ, 
              y = hwy) ) + 
  geom_point() + 
  geom_smooth()

  • The geometric object geom_smooth() draws a smooth curve fitted to the data.

ggplot() workflow

Common problems in ggplot()

ggplot(data = mpg,
       mapping = 
          aes(x = displ, 
              y = hwy) ) +
 geom_point()
 + geom_smooth()
  • One common problem when creating ggplot2 graphics is to put the + in the wrong place.
    • Correct Approach: Always place the + at the end of the previous line, NOT at the beginning of the next line.

About geom_smooth()

  • Using regression—one of the machine learning methods—the geom_smooth() visualizes the predicted value of the y variable for a given value of the x variable.

  • What Does the Grey Ribbon Represent?

    • The grey ribbon illustrates the uncertainty around the estimated prediction curve.
    • We are 95% confident that the actual relationship between x and y variables falls within the grey ribbon.

geom_point() with geom_smooth(method = lm)

ggplot( data = mpg,
        mapping = 
          aes(x = displ, 
              y = hwy) ) + 
  geom_point() + 
  geom_smooth(method = "lm")

  • method = "lm" specifies that a linear model (lm), called a linear regression model.

Relationship ggplot()

  • How many points are in this plot?
  • How many observations are in the mpg data.frame?

Overplotting problem

  • Many points overlap each other.

    • This problem is known as overplotting.
  • When points overlap, it’s hard to know how many data points are at a particular location.

  • Overplotting can obscure patterns and outliers, leading to potentially misleading conclusions.

Overplotting and Transparency with alpha

# alpha = 0.33 should be located
# within the geom function,
# NOT within the aesthetic function

ggplot( data = mpg,
        mapping = 
          aes(x = displ, 
              y = hwy) ) + 
  geom_point( alpha = 0.33 ) 

  • We can set a transparency level (alpha) between 0 (full transparency) and 1 (no transparency) manually.

Overplotting and Transparency with alpha

ggplot( data = mpg,
        mapping = 
          aes(x = displ, 
              y = hwy) ) + 
  geom_point( alpha = .33 )

  • We can set an aesthetic property manually, as seen above, not within the aes() function but within the geom_*() function.

🎨✨ Aesthetic Mappings

Aesthetic Mappings

  • In the plot above, one group of points (highlighted in red) seems to fall outside of the linear trend.

    • How can you explain these cars? Are those hybrids?

Aesthetic Mappings

  • An aesthetic is a visual property (e.g., size, shape, color) of the objects (e.g., class) in your plot.

  • You can display a point in different ways by changing the values of its aesthetic properties.

Adding a color to the Plot

ggplot( data = mpg,
        mapping = 
          aes(x = displ,
              y = hwy, 
              color = class) ) + 
  geom_point()

Adding a shape to the Plot

ggplot( data = mpg,
        mapping = 
          aes(x = displ,
              y = hwy, 
              shape = class) ) + 
  geom_point()

Adding a size to the Plot

ggplot( data = mpg,
        mapping =
          aes(x = displ,
              y = hwy,
              size = class) ) +
  geom_point()

Specifying a color to the Plot, Manually

ggplot(data = mpg,
       mapping = 
         aes(x = displ, 
             y = hwy) ) + 
  geom_point(color = "blue")

Specifying a color to the Plot, Manually

ggplot(data = mpg,
       mapping = 
         aes(x = displ, 
             y = hwy) ) + 
  geom_smooth(color = "darkorange") 

Specifying a fill to the Plot, Manually

ggplot(data = mpg,
       mapping = 
         aes(x = displ, 
             y = hwy) ) + 
  geom_smooth(color = "darkorange",
              fill = "darkorange") 

  • In general, each geom_*() has a different set of aesthetic parameters.
    • E.g., fill is available for geom_smooth(), not geom_point().

Specifying a size to the Plot, Manually

ggplot(data = mpg,
       mapping =
         aes(x = displ,
             y = hwy) ) +
  geom_point(size = 3)  # in *mm*

Specifying an alpha to the Plot, Manually

ggplot(data = mpg,
       mapping = 
         aes(x = displ, 
             y = hwy) ) + 
  geom_point(alpha = .3) 

  • We’ve done this to address the issue of overplotting in the scatterplot.

Mapping Aesthetics vs. Setting Them Manually

Aesthetic Mapping

  • Links data variables to visible properties on the graph
  • Different categories → different colors or shapes

Setting Aesthetics Manually

  • Customize visual properties directly in geom_*() outside of aes()
  • Useful for setting fixed colors, sizes, or transparency unrelated to data variables

🧹 Clutter is Your Enemy!

Cognitive Load

  • Every element on a slide or screen adds to the viewer’s cognitive load
    the mental effort required to process information.
  • The more elements you include, the more brainpower your audience must spend deciphering the message.
  • Example: Complex graphs, dense text, or excessive decoration can easily overwhelm viewers.
  • When cognitive load is too high, people lose focus or misinterpret the main idea.
  • 🎯 Goal: Communicate as much insight as possible with the least mental effort required from the audience.

Why Reduce Clutter?

  • Clutter: Any visual element that takes up space without improving understanding.
  • Clutter distracts, slows comprehension, and obscures your main point.
  • Strive for clarity: Clean, purposeful visuals promote focus and engagement.
    • Less clutter → clearer message → more attention to what matters.
  • Practical Tips
    • Keep data visually balanced — avoid crowding one side of the graph.
    • Limit overlapping or superimposed elements (e.g., no more than 3–4 lines per plot).
    • Use whitespace strategically to let key information breathe.
    • Eliminate anything that doesn’t support your story.

Clutter is Your Enemy!

  • Which one do you prefer?

Log Transformation: Reducing Clutter

  • Problem: When data points are densely packed, patterns become hard to see.
    • This often happens with extreme values or skewed distributions.
    • Dense clusters create visual clutter, hiding meaningful relationships.
  • Solution: Apply a log transformation!
    • 📏 Spreads out data — makes points more evenly distributed across the plot.
    • 🔍 Reduces outlier impact, revealing clearer patterns and trends.
    • 🧠 Improves interpretability — helps audiences grasp relationships faster.
    • 🎯 Promotes focused, informative visualization without unnecessary complexity.

A Little Bit of Math for Logarithm

  • The logarithm function, \(y = \log_{b}\,(\,x\,)\), looks like ….

🧮 A Little Bit of Math for Logarithms

Common Logarithm

💻 In R

log10(x)  # Common log (base 10)
  • \(\color{blue}{\log_{10}(x)}\) The base-\(10\) logarithm is called the common logarithm.

  • \(\log_{10}(100)\)
    The base-10 logarithm of 100 is 2, because
    \(10^{2} = 100\).

Natural Logarithm

💻 In R

log(x)    # Natural log (base e)
  • \(\log_{e}(x)\)
    The base-\(e\) logarithm is called the natural logarithm,
    where \(e = 2.718\ldots\) is the Euler’s number.

  • \(\color{blue}{\log(x)}\) or \(\ln(x)\)
    Both denote the natural log of \(x\).

  • \(\log_{e}(7.389\ldots)\)
    The natural log of 7.389⋯ is 2, because
    \(e^{2} = 7.389\ldots\).

The Use of Logarithms: Handling Skewed Data

  1. Consider a logarithmic scale when a variable is heavily skewed.
    • It helps visualize both small and large values effectively.

Without Log Transformation

With Log Transformation

The Use of Logarithms: Focusing on Percentage Change

  1. Consider a logarithmic scale when percentage changes are more meaningful than absolute changes.
  • Percentage changes are widely used to interpret relative differences:
    • Stock prices — show proportional gains or losses relative to initial value.
    • Housing prices — highlight comparable market trends across regions.
    • GDP growth — expressed as a percentage to reflect economic performance.
    • Income levels — a $1,000 increase matters more to lower-income individuals.

🧮 Logarithms and Percentage Change

For a small change in a variable \(x\) from \(x_{0}\) to \(x_{1}\): \[ \Delta \log(x) = \log(x_{1}) - \log(x_{0}) \]

\[ \,\approx \frac{x_{1} - x_{0}}{x_{0}} \]

\[ = \frac{\Delta x}{x_{0}}\quad\; \]

  • This shows that a log transformation approximates percentage change when the change is small.
    • Logs emphasize relative rather than absolute differences.
    • Interpretation: An increase in log(x) of 0.01 corresponds to approximately a 1% increase in x.

🌍 Example: GDP per Capita vs. Life Expectancy

Linear Scale

Log Scale

  • Slope = 8.4 : \(\;\) A 1-unit increase of log(GDP per capita) is associated with an increase in life expectancy of about 8.4 years.
    • A 1% increase in GDP per capita is associated with an increase in life expectancy of about 0.084 years (≈ 30.7 days).

Relationship ggplot()

Try it outClasswork 11: Relationship Plots.

🖼️🖼️ Facets

Facets

  • Adding too many aesthetics (e.g., color, shape, size) can sometimes make a plot cluttered and hard to interpret.
  • To incorporate an additional variable, especially a categorical one, we can use facets — separate subplots that each show a subset of the data.
    • Faceting helps reveal patterns within groups while keeping each plot clean and focused.

facet_wrap(~ VAR)

ggplot(data = gapminder,
       mapping = 
         aes(x = log10(gdpPercap), 
             y =lifeExp)) + 
  geom_point(alpha = .4) + 
  facet_wrap( ~ continent)

  • To facet our plot by a single variable, we can use facet_wrap().

facet_wrap(~ VAR) with nrow

ggplot(data = gapminder,
       mapping = 
         aes(x = log10(gdpPercap), 
             y =lifeExp)) + 
  geom_point(alpha = .4) + 
  facet_wrap( ~ continent,
              nrow = 1)

  • nrow determines the number of rows to use when laying out the facets.

facet_wrap(~ VAR) with ncol

ggplot(data = gapminder,
       mapping = 
         aes(x = log10(gdpPercap), 
             y =lifeExp)) + 
  geom_point(alpha = .4) + 
  facet_wrap( ~ continent,
              ncol = 1)

  • ncol determines the number of columns to use when laying out the facets.

facet_wrap(~ VAR) with scales = "free_x"

ggplot(data = gapminder,
       mapping = 
         aes(x = log10(gdpPercap), 
             y =lifeExp)) + 
  geom_point(alpha = .4) + 
  facet_wrap( ~ continent,
              scales = "free_x")

  • scales = "free_x" allow for different scales of x-axis

facet_wrap(~ VAR) with scales = "free_y"

ggplot(data = gapminder,
       mapping = 
         aes(x = log10(gdpPercap), 
             y =lifeExp)) + 
  geom_point(alpha = .4) + 
  facet_wrap( ~ continent,
              scales = "free_y")

  • scales = "free_y" allow for different scales of y-axis

facet_wrap(~ VAR) with scales = "free"

ggplot(data = gapminder,
       mapping = 
         aes(x = log10(gdpPercap), 
             y =lifeExp)) + 
  geom_point(alpha = .4) + 
  facet_wrap( ~ continent,
              scales = "free")

Time Trend ggplot()

NVDA Stock Price Data

  • The nvda data.frame includes NVIDIA’s stock information from 2019-01-02 to 2025-10-30.

Scatterplot for Time Trend?

path <- 
  "https://bcdanl.github.io/data/nvda_2015_2025.csv"
nvda <- read_csv(path)

ggplot( data = nvda,
        mapping = aes(
          x = Date, 
          y = Close) ) + 
  geom_point(size = .5)

Line Chart with geom_line()

path <- 
  "https://bcdanl.github.io/data/nvda_2015_2025.csv"
nvda <- read_csv(path)

ggplot( data = nvda,
        mapping = aes(
          x = Date, 
          y = Close) ) + 
  geom_point(size = .5) +
  geom_line()

  • geom_line() draws a line by connecting data points in order of the variable on the x-axis.

The Connection Principle

  • We tend to think of objects that are physically connected as part of a group.
  • Look at this figure.
    • Your eyes probably pair the shapes connected by lines rather than similar color, size, or shape!
  • We frequently leverage the connection principle is in line charts, to help our eyes see order in the data.

Line Chart with geom_line() and geom_smooth()

path <- 
  "https://bcdanl.github.io/data/nvda_2015_2025.csv"
nvda <- read_csv(path)

ggplot( data = nvda,
        mapping = aes(
          x = Date, 
          y = Close) ) + 
  geom_point(size = .5) +
  geom_line() +
  geom_smooth()

  • geom_smooth() helps reveal underlying long-term trends by smoothing out variability in the observations.

Tech Stocks’ Prices in October

  • The tech_october data.frame includes stock information about AAPL, MSFT, META, and NVDA in October 2025.

Time Trend of Tech Stock Price

tech_october <- 
  read_csv(
    "https://bcdanl.github.io/data/tech_stocks_2025_10.csv"
    )

ggplot( data = tech_october,
        mapping = aes(
          x = Date, 
          y = Close) ) + 
  geom_line() 

  • Something has gone wrong. What happened?

Time Trend of Tech Stock Price

# `ggplot` needs to be 
# explicitly informed that 
# daily observations are grouped 
# by `Ticker`
# for it to understand 
# the grouping structure

ggplot( data = tech_october,
        mapping = aes(
          x = Date, 
          y = Close,
          color = Ticker) ) + 
  geom_line(size = 2) # thicker lines

  • We can use the group, color, or linetype aesthetic to tell ggplot about the firm-level grouping structure in the dataset.

  • Try it outClasswork 13: Time Trend Plots.

Distribution ggplot() - Histograms

Histograms

  • Histograms are used to visualize the distribution of a numeric variable.

  • Histograms divide data into bins and count the number of observations in each bin.

Titanic Data

Histograms with geom_histogram()

titanic <- 
  read_csv(
    "https://bcdanl.github.io/data/titanic_cleaned.csv")

ggplot(data = titanic,
       mapping = 
         aes(x = age)) + 
  geom_histogram()

  • geom_histogram() creates a histogram.
    • We map the x aesthetic to a variable.

geom_histogram() with bins

ggplot(data = titanic,
       mapping = 
         aes(x = age)) + 
  geom_histogram(bins = 5)

  • bins: Specifies the number of bins
  • Be careful: the number of bins can greatly influence the shape of a histogram.

geom_histogram() with binwidth

ggplot(data = titanic,
       mapping = 
         aes(x = age)) + 
  geom_histogram(binwidth = 1)

  • binwidth: Specifies the width of each bin
  • We choose either the bins option or the binwidth option.

Customizing the color and fill Aesthetics

ggplot(data = titanic,
       mapping = 
         aes(x = age)) + 
  geom_histogram(
    binwidth = 2,
    fill = 'lightblue',
    color = 'black')

  • fill: Fills the bars with a specific color.
  • color: Adds an outline of a specific color to the bars.

🎨 Design with Colorblind in Mind

Design with Colorblind in Mind

Types of Colorblindness

  • About 8% of men and 0.5% of women experience some form of colorblindness.
  • To make visualizations more accessible and colorblind-friendly, consider:
    1. Using colorblind-friendly palettes
    2. Adding shape to scatterplots or linetype to line charts
    3. Including additional visual cues to highlight important information (e.g., annotations or labels)

ggthemes package

install.packages("ggthemes")
library(ggthemes)
  • The ggthemes package offers a variety of additional themes and color scales for enhancing ggplot2 visualizations:
    • Accessible color palettes, including options designed for colorblind-friendly viewing
      • For the color aesthetic mapping: scale_color_colorblind() or scale_color_tableau()
      • For the fill aesthetic mapping: scale_fill_colorblind() or
        scale_fill_tableau()
    • Predefined thematic styles inspired by well-known publications and design aesthetics (e.g., theme_economist(), theme_wsj())

ggthemes::scale_color_colorblind()

ggplot( data = mpg,
        mapping = 
          aes(x = displ,
              y = hwy, 
              color = class) ) + 
  geom_point(size = 3) +
  scale_color_colorblind()

  • When mapping color in aes(), we can use scale_color_*()

ggthemes::scale_color_tableau()

ggplot( data = mpg,
        mapping = 
          aes(x = displ,
              y = hwy, 
              color = class) ) + 
  geom_point(size = 3) +
  scale_color_tableau()

  • scale_color_tableau() provides color palettes used in Tableau.

💡 Quick Detour

ggthemes::theme_economist()

ggplot( data = mpg,
        mapping = 
          aes(x = displ,
              y = hwy, 
              color = class) ) + 
  geom_point(size = 3) +
  scale_color_colorblind() +
  theme_economist()

  • theme_economist() approximates the style of The Economist.

💡 Quick Detour

ggthemes::theme_wsj()

ggplot( data = mpg,
        mapping = 
          aes(x = displ,
              y = hwy, 
              color = class) ) + 
  geom_point(size = 3) +
  scale_color_colorblind() +
  theme_wsj()

  • theme_wsj() approximates the style of The Wall Street Journal.

Distribution ggplot() - Boxplots

Boxplots

  • Boxplots visualize how the distribution of a numeric variable varies across levels of a categorical variable.
  • They display the median, quartiles, and potential outliers, providing a compact summary of the data.

Boxplots with geom_boxplot()

ggplot(data = mpg,
       mapping = 
         aes(x = class,
             y = hwy)) + 
  geom_boxplot() 

  • geom_boxplot() creates a boxplot;
    • Mappings: one numeric variable and one categorical variable to the x and y aesthetics

Horizontal Boxplots

ggplot(data = mpg,
       mapping = 
         aes(x = hwy,
             y = class)) + 
  geom_boxplot() 

  • Boxplots can be displayed horizontally or vertically.
    • A horizontal boxplot works well when category names are long.

Customizing the fill Aesthetic

# 1. `show.legend = FALSE` turns off 
#     the legend information
# 2. `scale_fill_colorblind()` or
#    `scale_fill_tableau()`
#     applies a color-blind friendly 
#     palette to the `fill` aesthetic
# install.packages("ggthemes")
library(ggthemes) 
ggplot(data = mpg,
       mapping = 
         aes(x = hwy,
             y = class,
             fill = class)) + 
  geom_boxplot(
    show.legend = FALSE) +
  scale_fill_tableau() 

  • fill: Maps a variable to the fill colors used in the boxplot.
  • ggthemes::scale_fill_tableau(): A colorblind-friendly Tableau-style palette for the fill aesthetic.

Sorted Boxplots with fct_reorder(CATEGORICAL, NUMERICAL)

# labs() can label
#   x-axis, y-axis, and more

ggplot(data = mpg,
       mapping = 
        aes(x = hwy,
            y = 
             fct_reorder(class, hwy),
            fill = class)) + 
  geom_boxplot(
    show.legend = FALSE) +
  scale_fill_tableau() +
  labs(x = "Highway MPG",
       y = "Class") 

  • fct_reorder(CATEGORICAL, NUMERICAL): Reorders the categories of the CATEGORICAL by the median of the NUMERICAL.

Distribution ggplot() - Bar Charts

Bar Charts

  • Bar charts are used to visualize the distribution of a categorical variable.

  • Bar charts display the count (or proportion) of observations for each category.

Diamond Data

  • ggplot2::diamonds is a data.frame containing the prices and other attributes of almost 54,000 diamonds.

Bar Charts with geom_bar()

ggplot(data = diamonds,
       mapping = aes(x = cut)) + 
  geom_bar()

  • geom_bar() creates a bar chart.
    • We map either the x or y aesthetic to the variable.

Horizontal Bar Charts

ggplot(data = diamonds,
       mapping = aes(y = cut)) + 
  geom_bar()

  • Bar charts can be horizontal or vertical.
    • A horizontal bar chart is a good option for long category names.

Data Transformation - count(): Counting Occurrences of Each Category in a Categorical Variable

  • The figure below demonstrates how the counting process works with geom_bar().

Colorful Bar Charts with the fill Aesthetic

ggplot(data = diamonds,
       mapping = 
         aes(x = cut, 
             fill = cut)) + 
  geom_bar(
    show.legend = FALSE
    ) 

  • We can color bar charts using the fill aesthetic.

Data Transformation - count(): Counting Occurrences Across Two Categorical Variables

DATA.FRAME |> count(CATEGORICAL_VARIABLE_1, CATEGORICAL_VARIABLE_2)
  • The data transformation function count() calculates the frequency of each unique combination of values across two categorical variables.
diamonds |> count(cut, clarity)
  • diamonds |> count(cut, clarity) returns the data.frame with the three variables, cut, clarity, and n:
    • n: the number of occurrences of each unique combination of values in cut and clarity

Stacked Bar Charts with the fill Aesthetic

# Mapping the `fill` aesthetic 
# to other CATEGORICAL variable
# gives a stacked bar chart

ggplot(data = diamonds,
       mapping = 
         aes(x = cut, 
             fill = clarity)) + 
  geom_bar()

  • This describes how the distribution of clarity varies by cut, with total bar height for overall count and segments for each clarity level.

100% Stacked Bar Charts with the fill Aesthetic & the position="fill"

ggplot(data = diamonds,
       mapping = 
         aes(x = cut, 
             fill = clarity)) + 
  geom_bar(position = "fill") +
  labs(y = "Proportion")

  • This describes how the distribution of clarity varies by cut, displaying the proportion of each clarity within each cut.

Clustered Bar Charts with the fill Aesthetic & the position="dodge"

ggplot(data = diamonds,
       mapping = 
         aes(x = cut, 
             fill = clarity)) + 
  geom_bar(position = "dodge")

  • This shows how the distribution of clarity varies by cut, with separate bars for each clarity level within each cut category.

Stacked Bar Charts using the fill Aesthetic and the position = "stack"

ggplot(data = diamonds,
       mapping = 
         aes(x = cut, 
             fill = clarity)) + 
  geom_bar(position = "stack")

  • The default position option is position = "stack"

Proportion Bar Charts with geom_bar()

ggplot(data = diamonds,
       mapping = 
         aes(x = cut, 
             y = after_stat(prop),
             group = 1)) + 
  geom_bar()

  • after_stat(prop): Calculates the proportion of the total count.
  • group = 1: Ensures the proportions are calculated over the entire data.frame, not within each group of cut

Bar Charts with geom_col()

df <- mpg |> 
  count(class)

ggplot(data = df,
       mapping = 
         aes(x = n, 
             y = class)) + 
  geom_col()

  • geom_col() creates bar charts where the height of bars directly represents values in a column in a given data.frame.
    • geom_col() requires both x- and y- aesthetics.

Sorted Bar Charts with fct_reorder(CATEGORICAL, NUMERICAL)

df <- mpg |> 
  count(class)

ggplot(data = df,
       mapping = 
         aes(x = n,
             y = 
               fct_reorder(class, n))
       ) + 
  geom_col() +
  labs(y = "Class")

  • fct_reorder(CATEGORICAL, NUMERICAL): Reorders the categories of the CATEGORICAL by the median of the NUMERICAL.

📊 Effective Distribution Charts for Category Comparisons

Choosing the Right Bar Charts for Comparing Components and Totals

  • Which type of bar chart is most effective for your data?

  • Which type of bar chart best meets your visualization goals?

Stacked Bar Charts

  • Purpose: Show the composition of each category while still conveying the overall total.
  • When to use: Your primary focus is the total bar height, with a secondary interest in how subcomponents contribute.
  • Be cautious: Precise comparisons across subcomponents are difficult because segments do not share a common baseline.
  • Tip: If you need to emphasize totals plus composition, a stacked bar chart is a strong choice.

100% Stacked Bar Charts

  • Purpose: Display the proportional composition of each category by normalizing all bars to the same height.
  • When to use: Your focus is on comparing relative percentages across categories rather than absolute totals.
  • Be cautious: While proportions are clear, absolute sizes or totals are not visible.
  • Tip: Choose a 100% stacked bar chart when you want to emphasize proportional differences across categories.

Clustered Bar Charts

  • Purpose: Plot each subcomponent as a separate bar within each category for side-by-side comparison.
  • When to use: Your focus is on comparing individual subcomponents across categories.
  • Be cautious: Can become crowded with many categories or subcomponents.
  • Tip: Choose a clustered bar chart when you want to emphasize precise comparisons between subcomponents—both within each cluster and across clusters.

Pie Charts: Alternative to Bar Charts?

  • Pie charts display the proportions of a whole.
    • Each slice represents a part of the total.
    • But is a pie chart an effective alternative to a bar chart?

Pie Charts: Alternative to Bar Charts?

Humans are generally better at judging lengths than angles.

  • Still, pie charts can be useful in certain cases.

  1. Pie charts work best when there are very few categories—ideally four or fewer.
  2. Pie charts are effective when the goal is to highlight simple, recognizable fractions (e.g., 25%, 50%, 75%).

Pie Charts: Alternative to Bar Charts?

  1. Pie charts are not ideal when the audience needs to compare the size of individual shares across categories.

  1. Pie charts are not ideal when the goal is to compare the overall distribution of categories.

🔢 Understanding and Visualizing Integer Data

Numeric or Categorical?

Treat as Numeric (Continuous)

  • When the integer reflects a magnitude, intensity, or a meaningful ordered scale.
  • Examples:
    • Age (18, 19, 20, 21, …)
    • MPG values (27, 28, 29, 30, …)
    • Temperature readings (whole-number data)

Treat as Categorical (Discrete)

  • When the integer is simply a label for a category, and the numeric order does not represent meaningful numeric differences.

  • Examples:

    • Month (1–12)
    • Day of week (1–7)
    • ZIP codes
    • Student ID numbers

Historams or Bar Charts?

Histograms for the Age variable

Bar Charts for the Age variable

  • In ggplot, the distribution of an integer variable can look quite similar whether using geom_histogram() or geom_bar().

  • As shown above, in Python and other tools, these visualizations can behave differently, leading to noticeably different outputs.