Data Visualization with ggplot
January 26, 2026
ggplot - First Steps
The mpg data frame, provided by ggplot2, contains observations collected by the US Environmental Protection Agency on 38 models of car.
Q. Do cars with big engines use more fuel than cars with small engines?
displ: a car’s engine size, in liters.hwy: a car’s fuel efficiency on the highway, in miles per gallon (mpg).What does the relationship between engine size and fuel efficiency look like?
ggplotmpg, run the above code to put displ on the x-axis and hwy on the y-axis.A ggplot graphic is a mapping of variables in data to aesthetic attributes of geometric objects.
Three Essential Components in ggplot() Graphics:
data: data.frame containing the variables of interest.geom_*(): geometric object in the plot (e.g., point, line, bar, histogram, boxplot).aes(): aesthetic attributes of the geometric object (e.g., x-axis, y-axis, color, shape, size, fill) mapped to variables in the data.frame.ggplotggplot():
data = mpggeom_point()aes(x = displ, y = hwy)ggplot()geom_point()geom_smooth()geom_point() with geom_smooth()
geom_smooth() draws a smooth curve fitted to the data.ggplot() workflowggplot()ggplot2 graphics is to put the + in the wrong place.
+ at the end of the previous line, NOT at the beginning of the next line.geom_smooth()Using regression—one of the machine learning methods—the geom_smooth() visualizes the predicted value of the y variable for a given value of the x variable.
What Does the Grey Ribbon Represent?
x and y variables falls within the grey ribbon.geom_point() with geom_smooth(method = lm)method = "lm" specifies that a linear model (lm), called a linear regression model.ggplot()mpg data.frame?Many points overlap each other.
When points overlap, it’s hard to know how many data points are at a particular location.
Overplotting can obscure patterns and outliers, leading to potentially misleading conclusions.
alpha
alpha) between 0 (full transparency) and 1 (no transparency) manually.alphaaes() function but within the geom_*() function.
In the plot above, one group of points (highlighted in red) seems to fall outside of the linear trend.
An aesthetic is a visual property (e.g., size, shape, color) of the objects (e.g., class) in your plot.
You can display a point in different ways by changing the values of its aesthetic properties.
color to the Plotshape to the Plotsize to the Plotcolor to the Plot, Manuallycolor to the Plot, Manuallyfill to the Plot, Manually
geom_*() has a different set of aesthetic parameters.
fill is available for geom_smooth(), not geom_point().size to the Plot, Manuallyshape to the Plot, Manuallyshape to the Plot, Manually
0–25.color and fill interact with shapesshapes look similar (e.g., 0, 15, and 22 are all squares), but they behave differently depending on whether the shape supports color and/or fill.0–14)
color for the outlinefill15–20)
color for the entire point21–25)
color for the borderfill for the insidefill coloralpha to the Plot, Manuallygeom_*() outside of aes()data.frames and aes() in ggplotdata.frame in ggplot Layersggplot() (they apply to all layers),geom_*() layer.ggplot(data = mpg, ...) sets the global data to the full mpg data.frame.geom_point(...) uses the full dataset (mpg).geom_smooth(data = mpg |> filter(class == "subcompact"), ...) uses local data, so the smooth line is fit only to subcompact cars.data = ... inside a geom_*() layer overrides the global data = ...aes() in ggplot Layersggplot() (they apply to all layers),geom_*() layer.ggplot(mapping = aes(...)) sets the global aesthetic mapping to all layers.geom_point(aes(...), alpha = .3) adds local aesthetics that apply only to the points.geom_smooth(se = F) adds a local aesthetic (se = F) that applies only to the smooth line.aes(...) inside a geom_*() layer adds to or overrides the global aes(...)

\(\color{blue}{\log_{10}(x)}\) The base-\(10\) logarithm is called the common logarithm.
\(\log_{10}(100)\)
The base-10 logarithm of 100 is 2, because
\(10^{2} = 100\).
\(\log_{e}(x)\)
The base-\(e\) logarithm is called the natural logarithm,
where \(e = 2.718\ldots\) is the Euler’s number.
\(\color{blue}{\log(x)}\) or \(\ln(x)\)
Both denote the natural log of \(x\).
\(\log_{e}(7.389\ldots)\)
The natural log of 7.389⋯ is 2, because
\(e^{2} = 7.389\ldots\).


For a small change in a variable \(x\) from \(x_{0}\) to \(x_{1}\): \[ \Delta \log(x) = \log(x_{1}) - \log(x_{0}) \]
\[ \,\approx \frac{x_{1} - x_{0}}{x_{0}} \]
\[ = \frac{\Delta x}{x_{0}}\quad\; \]


Try it out → Classwork 2: Relationship Plots.
color, shape, size) can sometimes make a plot cluttered and hard to interpret.facet_wrap(~ VAR)
facet_wrap().facet_wrap(~ VAR) with nrow
nrow determines the number of rows to use when laying out the facets.facet_wrap(~ VAR) with ncol
ncol determines the number of columns to use when laying out the facets.facet_wrap(~ VAR) with scales = "free_x"
scales = "free_x" allow for different scales of x-axisfacet_wrap(~ VAR) with scales = "free_y"
scales = "free_y" allow for different scales of y-axisfacet_wrap(~ VAR) with scales = "free"
scales = "free" allow for different scales of both x-axis and y-axisfacet_grid(ROW ~ COL)
facet_grid(ROW ~ COL).nrow, ncol, and scales options also work with facet_grid()ggplot()nvda data.frame includes NVIDIA’s stock information from 2019-01-02 to 2025-10-30.geom_line()
geom_line() draws a line by connecting data points in order of the variable on the x-axis.
geom_line() and geom_smooth()
geom_smooth() helps reveal underlying long-term trends by smoothing out variability in the observations.tech_october data.frame includes stock information about AAPL, MSFT, META, and NVDA in October 2025.

We can use the group, color, or linetype aesthetic to tell ggplot about the firm-level grouping structure in the dataset.
Try it out → Classwork 4: Time Trend Plots.
ggplot() - HistogramsHistograms are used to visualize the distribution of a numeric variable.
Histograms divide data into bins and count the number of observations in each bin.
geom_histogram()
geom_histogram() creates a histogram.
x aesthetic to a variable.geom_histogram() with binsbins: Specifies the number of binsgeom_histogram() with binwidthbinwidth: Specifies the width of each binbins option or the binwidth option.color and fill Aesthetics
fill: Fills the bars with a specific color.color: Adds an outline of a specific color to the bars.
Types of Colorblindness
shape to scatterplots or linetype to line chartsggthemes packagecolor aesthetic mapping: scale_color_colorblind() or scale_color_tableau()fill aesthetic mapping: scale_fill_colorblind() orscale_fill_tableau()theme_economist(), theme_wsj())ggthemes::scale_color_colorblind()
color in aes(), we can use scale_color_*()ggthemes::scale_color_tableau()
scale_color_tableau() provides color palettes used in Tableau.ggthemes::theme_economist()
theme_economist() approximates the style of The Economist.ggthemes::theme_wsj()
theme_wsj() approximates the style of The Wall Street Journal.ggplot() - geom_density()Kernel density plots visualize the distribution of a numeric variable, like a smoothed histogram.
Instead of counting observations in bins, geom_density() estimates a smooth curve of the distribution.
The y-axis is density (the curve area integrates to 1).
geom_density()geom_density() creates a kernel density curve.
x aesthetic to a numeric variable.color and fill Aesthetics
fill: fills the area under the curvecolor: controls the outline color of the curvelinewidth controls outline thicknesscolor (Outline) to a Group Variablecolor = gender draws multiple curves, one per group.fill to a Group Variable (Overlapping Distributions)alpha) so you can see both.When plotting several densities, position controls how the groups are placed:
position = "identity" (default): curves overlap in the same coordinate systemposition = "stack": stack densities on top of each otherposition = "fill": stack and normalize so the total height equals 1 at each x
For density plots, position affects how overlapping group distributions are visually combined.
position = "identity" (Best for Comparison)
alpha to reduce clutter.position = "stack" (Composition Emphasis)position = "fill" (Relative Composition Across x)geom_density() with adjustadjust (multiplies the default bandwidth)
adjust < 1 makes the curve less smoothadjust > 1 makes the curve more smooth

Tip
Visualization tip:
To visualize several distributions at once, kernel density plots will generally work better than histograms (because multiple histograms often overlap and get cluttered).
Sometimes it helps to overlay a density curve on top of a histogram.

aes(y = after_stat(density)) rescales the histogram so the density curve matches the histogram scale.ggplot() - Boxplots
geom_boxplot()geom_boxplot() creates a boxplot;
x and y aestheticsfill Aesthetic# 1. `show.legend = FALSE` turns off
# the legend information
# 2. `scale_fill_colorblind()` or
# `scale_fill_tableau()`
# applies a color-blind friendly
# palette to the `fill` aesthetic
# install.packages("ggthemes")
library(ggthemes)
ggplot(data = mpg,
mapping =
aes(x = hwy,
y = class,
fill = class)) +
geom_boxplot(
show.legend = FALSE) +
scale_fill_tableau() 
fill: Maps a variable to the fill colors used in the boxplot.ggthemes::scale_fill_tableau(): A colorblind-friendly Tableau-style palette for the fill aesthetic.fct_reorder(CATEGORICAL, NUMERICAL)
fct_reorder(CATEGORICAL, NUMERICAL): Reorders the categories of the CATEGORICAL by the median of the NUMERICAL.ggplot() - Bar ChartsBar charts are used to visualize the distribution of a categorical variable.
Bar charts display the count (or proportion) of observations for each category.
ggplot2::diamonds is a data.frame containing the prices and other attributes of almost 54,000 diamonds.geom_bar()geom_bar() creates a bar chart.
x or y aesthetic to the variable.count(): Counting Occurrences of Each Category in a Categorical Variablegeom_bar().
fill Aestheticfill aesthetic.count(): Counting Occurrences Across Two Categorical Variablescount() calculates the frequency of each unique combination of values across two categorical variables.diamonds |> count(cut, clarity) returns the data.frame with the three variables, cut, clarity, and n:
n: the number of occurrences of each unique combination of values in cut and clarityfill Aesthetic
clarity varies by cut, with total bar height for overall count and segments for each clarity level.fill Aesthetic & position="fill"
clarity varies by cut, displaying the proportion of each clarity within each cut.fill Aesthetic & position="dodge"clarity varies by cut, with separate bars for each clarity level within each cut category.fill Aesthetic & position_dodge2(preserve = "single")position_dodge2() is useful when some groups have missing categories and you want spacing between bars.fill Aesthetic and position = "stack"position option is position = "stack"geom_bar()after_stat(prop): Calculates the proportion of the total count.group = 1: Ensures the proportions are calculated over the entire data.frame, not within each group of cutgeom_col()geom_col() creates bar charts where the height of bars directly represents values in a column in a given data.frame.
geom_col() requires both x- and y- aesthetics.fct_reorder(CATEGORICAL, NUMERICAL)
fct_reorder(CATEGORICAL, NUMERICAL): Reorders the categories of the CATEGORICAL by the median of the NUMERICAL.


Which type of bar chart is most effective for your data?
Which type of bar chart best meets your visualization goals?






Humans are generally better at judging lengths than angles.
When the integer is simply a label for a category, and the numeric order does not represent meaningful numeric differences.
Examples:
Age variable
Age variable
In ggplot, the distribution of an integer variable can look quite similar whether using geom_histogram() or geom_bar().
As shown above, in Python and other tools, these visualizations can behave differently, leading to noticeably different outputs.