Lecture 10

Data Visualization with ggplot

Byeong-Hak Choe

SUNY Geneseo

November 5, 2025

Data Visualization with ggplot - First Steps

Grammar of Graphics

  • A grammar of graphics is a tool that enables us to concisely describe the components of a graphic.

Data Visualization - First Steps

library(tidyverse)
mpg
?mpg
  • The mpg data frame, provided by ggplot2, contains observations collected by the US Environmental Protection Agency on 38 models of car.

  • Q. Do cars with big engines use more fuel than cars with small engines?

    • displ: a car’s engine size, in liters.
    • hwy: a car’s fuel efficiency on the highway, in miles per gallon (mpg).
  • What does the relationship between engine size and fuel efficiency look like?

Creating a Scatterplot with ggplot

ggplot( data = mpg,
        mapping = 
          aes(x = displ, 
              y = hwy) ) + 
  geom_point()

  • To plot mpg, run the above code to put displ on the x-axis and hwy on the y-axis.

Components in the Grammar of Graphics

ggplot( data = DATA.FRAME,
        mapping = 
          aes( MAPPINGS ) ) + 
  GEOM_FUNCTION()
  • A ggplot graphic is a mapping of variables in data to aesthetic attributes of geometric objects.

  • Three Essential Components in ggplot() Graphics:

    1. data: data.frame containing the variables of interest.
    2. geom_*(): geometric object in the plot (e.g., point, line, bar, histogram, boxplot).
    3. aes(): aesthetic attributes of the geometric object (e.g., x-axis, y-axis, color, shape, size, fill) mapped to variables in the data.frame.

Creating a Scatterplot with ggplot

ggplot( data = mpg,
        mapping = 
          aes(x = displ, 
              y = hwy) ) + 
  geom_point()
  • Three Essential Components in This Particular ggplot():
    1. data = mpg
    2. geom_point()
    3. aes(x = displ, y = hwy)

Relationship ggplot()

Scatterplot with geom_point()

ggplot( data = mpg,
        mapping = 
          aes(x = displ, 
              y = hwy) ) + 
  geom_point()

Fitted Curve with geom_smooth()

ggplot( data = mpg,
        mapping = 
          aes(x = displ, 
              y = hwy) ) + 
  geom_smooth()

geom_point() with geom_smooth()

# To add a layer of 
# a `ggplot()` component, 
# we can simply add it to 
# the `ggplot()` with `+`.

ggplot( data = mpg,
        mapping = 
          aes(x = displ, 
              y = hwy) ) + 
  geom_point() + 
  geom_smooth()

  • The geometric object geom_smooth() draws a smooth curve fitted to the data.

ggplot() workflow

Common problems in ggplot()

ggplot(data = mpg,
       mapping = 
          aes(x = displ, 
              y = hwy) ) +
 geom_point()
 + geom_smooth()
  • One common problem when creating ggplot2 graphics is to put the + in the wrong place.
    • Correct Approach: Always place the + at the end of the previous line, NOT at the beginning of the next line.

About geom_smooth()

  • Using regression—one of the machine learning methods—the geom_smooth() visualizes the predicted value of the y variable for a given value of the x variable.

  • What Does the Grey Ribbon Represent?

    • The grey ribbon illustrates the uncertainty around the estimated prediction curve.
    • We are 95% confident that the actual relationship between x and y variables falls within the grey ribbon.

geom_point() with geom_smooth(method = lm)

ggplot( data = mpg,
        mapping = 
          aes(x = displ, 
              y = hwy) ) + 
  geom_point() + 
  geom_smooth(method = "lm")

  • method = "lm" specifies that a linear model (lm), called a linear regression model.

Relationship ggplot()

  • How many points are in this plot?
  • How many observations are in the mpg data.frame?

Overplotting problem

  • Many points overlap each other.

    • This problem is known as overplotting.
  • When points overlap, it’s hard to know how many data points are at a particular location.

  • Overplotting can obscure patterns and outliers, leading to potentially misleading conclusions.

Overplotting and Transparency with alpha

# alpha = 0.33 should be located
# within the geom function,
# NOT within the aesthetic function

ggplot( data = mpg,
        mapping = 
          aes(x = displ, 
              y = hwy) ) + 
  geom_point( alpha = 0.33 ) 

  • We can set a transparency level (alpha) between 0 (full transparency) and 1 (no transparency) manually.

Overplotting and Transparency with alpha

ggplot( data = mpg,
        mapping = 
          aes(x = displ, 
              y = hwy) ) + 
  geom_point( alpha = .33 )

  • We can set an aesthetic property manually, as seen above, not within the aes() function but within the geom_*() function.

Aesthetic Mappings

Aesthetic Mappings

  • In the plot above, one group of points (highlighted in red) seems to fall outside of the linear trend.

    • How can you explain these cars? Are those hybrids?

Aesthetic Mappings

  • An aesthetic is a visual property (e.g., size, shape, color) of the objects (e.g., class) in your plot.

  • You can display a point in different ways by changing the values of its aesthetic properties.

Adding a color to the Plot

ggplot( data = mpg,
        mapping = 
          aes(x = displ,
              y = hwy, 
              color = class) ) + 
  geom_point()

Adding a shape to the Plot

ggplot( data = mpg,
        mapping = 
          aes(x = displ,
              y = hwy, 
              shape = class) ) + 
  geom_point()

Adding a size to the Plot

ggplot( data = mpg,
        mapping =
          aes(x = displ,
              y = hwy,
              size = class) ) +
  geom_point()

Specifying a color to the Plot, Manually

ggplot(data = mpg,
       mapping = 
         aes(x = displ, 
             y = hwy) ) + 
  geom_point(color = "blue")

Specifying a color to the Plot, Manually

ggplot(data = mpg,
       mapping = 
         aes(x = displ, 
             y = hwy) ) + 
  geom_smooth(color = "darkorange") 

Specifying a fill to the Plot, Manually

ggplot(data = mpg,
       mapping = 
         aes(x = displ, 
             y = hwy) ) + 
  geom_smooth(color = "darkorange",
              fill = "darkorange") 

  • In general, each geom_*() has a different set of aesthetic parameters.
    • E.g., fill is available for geom_smooth(), not geom_point().

Specifying a size to the Plot, Manually

ggplot(data = mpg,
       mapping =
         aes(x = displ,
             y = hwy) ) +
  geom_point(size = 3)  # in *mm*

Specifying an alpha to the Plot, Manually

ggplot(data = mpg,
       mapping = 
         aes(x = displ, 
             y = hwy) ) + 
  geom_point(alpha = .3) 

  • We’ve done this to address the issue of overplotting in the scatterplot.

Mapping Aesthetics vs. Setting Them Manually

Aesthetic Mapping

  • Links data variables to visible properties on the graph
  • Different categories → different colors or shapes

Setting Aesthetics Manually

  • Customize visual properties directly in geom_*() outside of aes()
  • Useful for setting fixed colors, sizes, or transparency unrelated to data variables

Clutter is Your Enemy!

Cognitive Load

  • Every element on a slide or screen adds to the viewer’s cognitive load
    the mental effort required to process information.
  • The more elements you include, the more brainpower your audience must spend deciphering the message.
  • Example: Complex graphs, dense text, or excessive decoration can easily overwhelm viewers.
  • When cognitive load is too high, people lose focus or misinterpret the main idea.
  • 🎯 Goal: Communicate as much insight as possible with the least mental effort required from the audience.

Why Reduce Clutter?

  • Clutter: Any visual element that takes up space without improving understanding.
  • Clutter distracts, slows comprehension, and obscures your main point.
  • Strive for clarity: Clean, purposeful visuals promote focus and engagement.
    • Less clutter → clearer message → more attention to what matters.
  • ✨ Practical Tips
    • Keep data visually balanced — avoid crowding one side of the graph.
    • Limit overlapping or superimposed elements (e.g., no more than 3–4 lines per plot).
    • Use whitespace strategically to let key information breathe.
    • Eliminate anything that doesn’t support your story.

Clutter is Your Enemy!

  • Which one do you prefer?

Log Transformation: Reducing Clutter

  • Problem: When data points are densely packed, patterns become hard to see.
    • This often happens with extreme values or skewed distributions.
    • Dense clusters create visual clutter, hiding meaningful relationships.
  • Solution: Apply a log transformation!
    • 📏 Spreads out data — makes points more evenly distributed across the plot.
    • 🔍 Reduces outlier impact, revealing clearer patterns and trends.
    • 🧠 Improves interpretability — helps audiences grasp relationships faster.
    • 🎯 Promotes focused, informative visualization without unnecessary complexity.

A Little Bit of Math for Logarithm

  • The logarithm function, \(y = \log_{b}\,(\,x\,)\), looks like ….

🧮 A Little Bit of Math for Logarithms

  • \(\log_{10}(100)\) — the base-10 logarithm of 100 is 2, because
    \(10^{2} = 100\).

  • \(\log_{e}(x)\) — the base-\(e\) logarithm is called the natural logarithm,
    where \(e = 2.718\ldots\) is the Euler’s number.

  • \(\log(x)\) or \(\ln(x)\) — both denote the natural log of \(x\).

  • \(\log_{e}(7.389\ldots)\) — the natural log of 7.389⋯ is 2, because
    \(e^{2} = 7.389\ldots\).

💻 In R

log(x)    # Natural log (base e)
log10(x)  # Common log (base 10)

📉 The Use of Logarithms: Handling Skewed Data

  1. Consider a logarithmic scale when a variable is heavily skewed.
    • It helps visualize both small and large values effectively.

Without Log Transformation

With Log Transformation

📈 The Use of Logarithms: Focusing on Percentage Change

  1. Consider a logarithmic scale when percentage changes are more meaningful than absolute changes.
  • Percentage changes are widely used to interpret relative differences:
    • Stock prices — show proportional gains or losses relative to initial value.
    • Housing prices — highlight comparable market trends across regions.
    • GDP growth — expressed as a percentage to reflect economic performance.
    • Income levels — a $1,000 increase matters more to lower-income individuals.

🧮 Logarithms and Percentage Change

For a small change in a variable \(x\) from \(x_{0}\) to \(x_{1}\): \[ \begin{aligned} \Delta \log(x) &= \log(x_{1}) - \log(x_{0})\\ &\approx \frac{x_{1} - x_{0}}{x_{0}}\\ &= \frac{\Delta x}{x_{0}} \end{aligned} \]

  • This shows that a log transformation effectively represents percentage change!
    • Log transformation is useful for interpreting relative rather than absolute differences.

🌍 Example: GDP per Capita vs. Life Expectancy

Linear Scale

Log Scale

  • Interpretation: A 1-unit increase in log(gdpPercap) corresponds to a 100% increase (doubling) in gdpPercap.
    • A doubling of GDP per capita is associated with an increase in life expectancy of about 8.4 years.

Facets

Facets

  • Adding too many aesthetics (e.g., color, shape, size) can sometimes make a plot cluttered and hard to interpret.
  • To incorporate an additional variable, especially a categorical one, we can use facets — separate subplots that each show a subset of the data.
  • Faceting helps reveal patterns within groups while keeping each plot clean and focused.

facet_wrap(~ VAR)

ggplot(data = gapminder,
       mapping = 
         aes(x = log10(gdpPercap), 
             y =lifeExp)) + 
  geom_point(alpha = .4) + 
  facet_wrap( ~ continent)

  • To facet our plot by a single variable, we can use facet_wrap().

facet_wrap(~ VAR) with nrow

ggplot(data = gapminder,
       mapping = 
         aes(x = log10(gdpPercap), 
             y =lifeExp)) + 
  geom_point(alpha = .4) + 
  facet_wrap( ~ continent,
              nrow = 1)

  • nrow determines the number of rows to use when laying out the facets.

facet_wrap(~ VAR) with ncol

ggplot(data = gapminder,
       mapping = 
         aes(x = log10(gdpPercap), 
             y =lifeExp)) + 
  geom_point(alpha = .4) + 
  facet_wrap( ~ continent,
              ncol = 1)

  • ncol determines the number of columns to use when laying out the facets.

facet_wrap(~ VAR) with scales = "free_x"

ggplot(data = gapminder,
       mapping = 
         aes(x = log10(gdpPercap), 
             y =lifeExp)) + 
  geom_point(alpha = .4) + 
  facet_wrap( ~ continent,
              scales = "free_x")

  • scales = "free_x" allow for different scales of x-axis

facet_wrap(~ VAR) with scales = "free_y"

ggplot(data = gapminder,
       mapping = 
         aes(x = log10(gdpPercap), 
             y =lifeExp)) + 
  geom_point(alpha = .4) + 
  facet_wrap( ~ continent,
              scales = "free_y")

  • scales = "free_y" allow for different scales of y-axis

facet_wrap(~ VAR) with scales = "free"

ggplot(data = gapminder,
       mapping = 
         aes(x = log10(gdpPercap), 
             y =lifeExp)) + 
  geom_point(alpha = .4) + 
  facet_wrap( ~ continent,
              scales = "free")

  • scales = "free" allow for different scales of both x-axis and y-axis

Time Trend ggplot()

NVDA Stock Price Data

  • The nvda data.frame includes NVIDIA’s stock information from 2019-01-02 to 2024-10-18.

Scatterplot for Time Trend?

path <- 
  "https://bcdanl.github.io/data/nvda_2015_2025.csv"
nvda <- read_csv(path)

ggplot( data = nvda,
        mapping = aes(
          x = Date, 
          y = Close) ) + 
  geom_point(size = .5)

Line Chart with geom_line()

path <- 
  "https://bcdanl.github.io/data/nvda_2015_2025.csv"
nvda <- read_csv(path)

ggplot( data = nvda,
        mapping = aes(
          x = Date, 
          y = Close) ) + 
  geom_point(size = .5) +
  geom_line()

  • geom_line() draws a line by connecting data points in order of the variable on the x-axis.

The Connection Principle

  • We tend to think of objects that are physically connected as part of a group.
  • Look at this figure.
    • Your eyes probably pair the shapes connected by lines rather than similar color, size, or shape!
  • We frequently leverage the connection principle is in line charts, to help our eyes see order in the data.

Line Chart with geom_line() and geom_smooth()

path <- 
  "https://bcdanl.github.io/data/nvda_2015_2025.csv"
nvda <- read_csv(path)

ggplot( data = nvda,
        mapping = aes(
          x = Date, 
          y = Close) ) + 
  geom_point(size = .5) +
  geom_line() +
  geom_smooth()

  • geom_smooth() can also be useful for illustrating overall time trends.

Tech Stocks’ Prices in October

  • The tech_october data.frame includes stock information about AAPL, MSFT, META, and NVDA in October 2025.

Time Trend ggplot()

Tech Stock Price

tech_october <- 
  read_csv(
    "https://bcdanl.github.io/data/tech_stocks_2025_10.csv"
    )

ggplot( data = tech_october,
        mapping = aes(
          x = Date, 
          y = Close) ) + 
  geom_line() 

  • Something has gone wrong. What happened?

Time Trend ggplot()

Tech Stock Price

# `ggplot` needs to be 
# explicitly informed that 
# daily observations are grouped 
# by `Ticker`
# for it to understand 
# the grouping structure

ggplot( data = tech_october,
        mapping = aes(
          x = Date, 
          y = Close,
          color = Ticker) ) + 
  geom_line(size = 2) # thicker lines

  • We can use either group, color, or linetype aesthetic to tell ggplot explicitly about this firm-level structure.

Distribution ggplot() - Histogram

Histogram with geom_histogram()

  • Histograms are used to visualize the distribution of a numeric variable.

  • Histograms divide data into bins and count the number of observations in each bin.

Titanic Data

Histogram with geom_histogram()

titanic <- 
  read_csv(
    "https://bcdanl.github.io/data/titanic_cleaned.csv")

ggplot(data = titanic,
       mapping = 
         aes(x = age)) + 
  geom_histogram()

  • geom_histogram() creates a histogram.
    • We map the x aesthetic to the variable.

Histogram with geom_histogram() with bins

ggplot(data = titanic,
       mapping = 
         aes(x = age)) + 
  geom_histogram(bins = 5)

  • bins: Specifies the number of bins
  • The shape of a histogram can be sensitive to the number of bins!

Histogram with geom_histogram() with binwidth

ggplot(data = titanic,
       mapping = 
         aes(x = age)) + 
  geom_histogram(binwidth = 1)

  • binwidth: Specifies the width of each bin
  • We choose either the bins option or the binwidth option.

Customizing the Aesthetics

ggplot(data = titanic,
       mapping = 
         aes(x = age)) + 
  geom_histogram(
    binwidth = 2,
    fill = 'lightblue',
    color = 'black')

  • fill: Fills the bars with a specific color.
  • color: Adds an outline of a specific color to the bars.

Design with Colorblind in Mind

Design with Colorblind in Mind

Types of Colorblindness

  • Roughly 8% of men and half a percent of women are colorblind.

  • There are several techniques to make visualization more colorblind-friendly:

    1. Use color palettes that are colorblind-friendly
    2. Use shape for scatterplots and linetype for line charts
    3. Have some additional visual cue to set the important numbers apart

ggthemes package

  • The ggthemes package provides various themes for ggplot2 visualization:
    • Accessible color palettes, including those optimized for colorblind viewers.
      • E.g., scale_color_colorblind(), scale_color_tableau()
    • Unique, predefined themes for specific styles
      • E.g., theme_economist(), theme_wsj()

ggthemes::scale_color_colorblind()

ggplot( data = mpg,
        mapping = 
          aes(x = displ,
              y = hwy, 
              color = class) ) + 
  geom_point(size = 3) +
  scale_color_colorblind()

  • When mapping color in aes(), we can use scale_color_*()

ggthemes::scale_color_tableau()

ggplot( data = mpg,
        mapping = 
          aes(x = displ,
              y = hwy, 
              color = class) ) + 
  geom_point(size = 3) +
  scale_color_tableau()

  • scale_color_tableau() provides color palettes used in Tableau.

💡 Quick Detour

ggthemes::theme_economist()

ggplot( data = mpg,
        mapping = 
          aes(x = displ,
              y = hwy, 
              color = class) ) + 
  geom_point(size = 3) +
  scale_color_colorblind() +
  theme_economist()

  • theme_economist() approximates the style of The Economist.

💡 Quick Detour

ggthemes::theme_wsj()

ggplot( data = mpg,
        mapping = 
          aes(x = displ,
              y = hwy, 
              color = class) ) + 
  geom_point(size = 3) +
  scale_color_colorblind() +
  theme_wsj()

  • theme_wsj() approximates the style of The Wall Street Journal.

Distribution ggplot() - Boxplot

Boxplot with geom_boxplot()

  • Boxplots can be used to visualize how the distribution of a numeric variable varies by a categorical variable.
  • Boxplots display the median, quartiles, and potential outliers in the data.

Boxplot with geom_boxplot()

ggplot(data = mpg,
       mapping = 
         aes(x = class,
             y = hwy)) + 
  geom_boxplot() 

  • geom_boxplot() creates a boxplot;
    • Mappings: one numeric variable and one categorical variable to the x and y aesthetics

Horizontal Boxplots

ggplot(data = mpg,
       mapping = 
         aes(x = hwy,
             y = class)) + 
  geom_boxplot() 

  • Boxplots can be horizontal or vertical.
    • A horizontal boxplot is a good option for long category names.

Customizing the Aesthetics

# 1. `show.legend = FALSE` turns off 
#     the legend information
# 2. `scale_fill_colorblind()` or
#    `scale_fill_tableau()`
#     applies a color-blind friendly 
#     palette to the `fill` aesthetic
# To use the scale_fill_tableau():
library(ggthemes) 
ggplot(data = mpg,
       mapping = 
         aes(x = hwy,
             y = class,
             fill = class)) + 
  geom_boxplot(
    show.legend = FALSE) +
  scale_fill_tableau() 

  • fill: Maps a variable to the fill color of the boxes.
  • scale_fill_tableau(): A color-blind friendly palette to the fill aesthetic.

Sorted Boxplot with fct_reorder(CATEGORICAL, NUMERICAL)

# labs() can label
#   x-axis, y-axis, and more

ggplot(data = mpg,
       mapping = 
        aes(x = hwy,
            y = 
             fct_reorder(class, hwy),
            fill = class)) + 
  geom_boxplot(
    show.legend = FALSE) +
  scale_fill_tableau() +
  labs(x = "Highway MPG",
       y = "Class") 

  • fct_reorder(CATEGORICAL, NUMERICAL): Reorders the categories of the CATEGORICAL by the median of the NUMERICAL.

Distribution ggplot() - Bar Chart

Bar Chart with geom_bar()

  • Bar charts are used to visualize the distribution of a categorical variable.

  • geom_bar() divides data into bins and count the number of observations in each bin.

Diamond Data

Bar Chart with geom_bar()

ggplot(data = diamonds,
       mapping = aes(x = cut)) + 
  geom_bar()

  • geom_bar() creates a bar chart.
    • We map either the x or y aesthetic to the variable.

Horizontal Bar Chart

ggplot(data = diamonds,
       mapping = aes(y = cut)) + 
  geom_bar()

  • Bar charts can be horizontal or vertical.
    • A horizontal bar chart is a good option for long category names.

Counting Occurrences of Each Category in a Categorical Variable

  • The figure below demonstrates how the counting process works with geom_bar().

Colorful Bar Chart with the fill Aesthetic

ggplot(data = diamonds,
       mapping = 
         aes(x = cut, 
             fill = cut)) + 
  geom_bar(
    show.legend = FALSE
    ) 

  • We can color a bar chart using the fill aesthetic.

count(): Counting Occurrences Across Two Categorical Variables

DATA.FRAME |> count(CATEGORICAL_VARIABLE_1, CATEGORICAL_VARIABLE_2)
  • The data transformation function count() calculates the frequency of each unique combination of values across two categorical variables.
diamonds |> count(cut, clarity)
  • diamonds |> count(cut, clarity) returns the data.frame with the three variables, cut, clarity, and n:
    • n: the number of occurrences of each unique combination of values in cut and clarity

Stacked Bar Chart with the fill Aesthetic

# Mapping the `fill` aesthetic 
# to other CATEGORICAL variable
# gives a stacked bar chart

ggplot(data = diamonds,
       mapping = 
         aes(x = cut, 
             fill = clarity)) + 
  geom_bar()

  • This describes how the distribution of clarity varies by cut, with total bar height for overall count and segments for each clarity level.

100% Stacked Bar Chart with the fill Aesthetic & the position="fill"

ggplot(data = diamonds,
       mapping = 
         aes(x = cut, 
             fill = clarity)) + 
  geom_bar(position = "fill") +
  labs(y = "Proportion")

  • This describes how the distribution of clarity varies by cut, displaying the proportion of each clarity within each cut.

Clustered Bar Charts with the fill Aesthetic & the position="dodge"

ggplot(data = diamonds,
       mapping = 
         aes(x = cut, 
             fill = clarity)) + 
  geom_bar(position = "dodge")

  • This shows how the distribution of clarity varies by cut, with separate bars for each clarity level within each cut category.

Choosing the Right Bar Chart for Comparing Components and Totals

  • Which type of bar chart is most effective for your data?

  • Which type of bar chart best meets your visualization goals?

Stacked Bar Chart

  • Best for showing the breakdown of subcomponents within a category alongside the overall total.
  • Useful when your primary focus is on the total bar height while also showing subcomponent contributions.
  • Tip: Be cautious using stacked bars if the goal is to make precise comparisons between subcomponents across different bars, as they don’t all start from the same baseline.

  • If you need to emphasize total values along with component contributions, use a Stacked Bar Chart

100% Stacked Bar Chart

  • Shows the proportion of subcomponents within each category as a percentage of the total, normalizing all bars to the same height.
  • Allows for easier comparison between categories for subcomponent proportions, as each segment starts from a consistent baseline.
  • Ideal Use: When you want to compare relative percentages rather than absolute totals.

  • For comparing relative proportions between categories, a 100% Stacked Bar Chart is ideal.

Clustered Bar Chart

  • Plots each subcomponent as a separate bar, grouped by category, allowing for precise comparisons of each component across categories.
  • Ideal Use: When your main goal is to compare individual subcomponents side by side rather than focusing on an overall total or relative percentage.

  • When directly comparing individual subcomponents across categories, opt for a Clustered Bar Chart.

Stacked Bar Charts using the fill Aesthetic and the position = "stack"

ggplot(data = diamonds,
       mapping = 
         aes(x = cut, 
             fill = clarity)) + 
  geom_bar(position = "stack")

  • The default position option is position = "stack"

Proportion Bar Chart with geom_bar()

ggplot(data = diamonds,
       mapping = 
         aes(x = cut, 
             y = after_stat(prop),
             group = 1)) + 
  geom_bar()

  • after_stat(prop): Calculates the proportion of the total count.
  • group = 1: Ensures the proportions are calculated over the entire data.frame, not within each group of cut

Bar Chart with geom_col()

df <- mpg |> 
  count(class)

ggplot(data = df,
       mapping = 
         aes(x = n, 
             y = class)) + 
  geom_col()

  • geom_col() creates bar charts where the height of bars directly represents values in a column in a given data.frame.
    • geom_col() requires both x- and y- aesthetics.

Sorted Bar Chart with fct_reorder(CATEGORICAL, NUMERICAL)

df <- mpg |> 
  count(class)

ggplot(data = df,
       mapping = 
         aes(x = n,
             y = 
               fct_reorder(class, n))
       ) + 
  geom_col() +
  labs(y = "Class")

  • fct_reorder(CATEGORICAL, NUMERICAL): Reorders the categories of the CATEGORICAL by the median of the NUMERICAL.

Pie Charts: Alternative to Bar Charts?

  • Pie charts show the proportions of a whole.
    • Each slice represents a part of the total.
    • Is a pie chart a proper alternative to a bar chart?

Pie Charts: Alternative to Bar Charts?

  • Humans are better at judging lengths than angles.
    • Occasionally, using a pie chart can be a good idea.
  1. Pie charts work well only if you only have a few categories—four max.

  2. Pie charts work well if the goal is to emphasize simple fractions (e.g., 25%, 50%, or 75%).

Pie Charts: Alternative to Bar Charts?

  1. Pie charts are not the best choice if you want audiences to compare the size of shares.

  1. Pie charts are not the best choice if you want audiences to compare the distribution across categories.

Distribution of an Integer Variable

  • For data visualization, integer-type variables could be treated as either categorical (discrete) or numeric (continuous), depending on the context of analysis.

  • If the values of an integer-type variable means an intensity or an order, the integer variable could be numeric.

    • A variable of age integers (18, 19, 20, 21, …) could be numeric.
    • A variable of integer-valued MPG (27, 28, 29, 30, …) could be numeric.
  • If not, the integer variable is categorical.

    • A variable of month integers (1, ,2, …, 12) could be categorical.

Distribution of an Integer Variable

Bar Chart for the Age variable

Histogram for the Age variable

  • In ggplot, the distribution of an integer variable can appear quite similar when using geom_bar() and geom_histogram().
    • In Python or others, they can be quite different, as shown above.