Data Visualization with ggplot
November 5, 2025
ggplot - First Steps
The mpg data frame, provided by ggplot2, contains observations collected by the US Environmental Protection Agency on 38 models of car.
Q. Do cars with big engines use more fuel than cars with small engines?
displ: a car’s engine size, in liters.hwy: a car’s fuel efficiency on the highway, in miles per gallon (mpg).What does the relationship between engine size and fuel efficiency look like?
ggplotmpg, run the above code to put displ on the x-axis and hwy on the y-axis.A ggplot graphic is a mapping of variables in data to aesthetic attributes of geometric objects.
Three Essential Components in ggplot() Graphics:
data: data.frame containing the variables of interest.geom_*(): geometric object in the plot (e.g., point, line, bar, histogram, boxplot).aes(): aesthetic attributes of the geometric object (e.g., x-axis, y-axis, color, shape, size, fill) mapped to variables in the data.frame.ggplotggplot():
data = mpggeom_point()aes(x = displ, y = hwy)ggplot()geom_point()geom_smooth()geom_point() with geom_smooth()
geom_smooth() draws a smooth curve fitted to the data.ggplot() workflowggplot()ggplot2 graphics is to put the + in the wrong place.
+ at the end of the previous line, NOT at the beginning of the next line.geom_smooth()Using regression—one of the machine learning methods—the geom_smooth() visualizes the predicted value of the y variable for a given value of the x variable.
What Does the Grey Ribbon Represent?
x and y variables falls within the grey ribbon.geom_point() with geom_smooth(method = lm)method = "lm" specifies that a linear model (lm), called a linear regression model.ggplot()mpg data.frame?Many points overlap each other.
When points overlap, it’s hard to know how many data points are at a particular location.
Overplotting can obscure patterns and outliers, leading to potentially misleading conclusions.
alpha
alpha) between 0 (full transparency) and 1 (no transparency) manually.alphaaes() function but within the geom_*() function.
In the plot above, one group of points (highlighted in red) seems to fall outside of the linear trend.
An aesthetic is a visual property (e.g., size, shape, color) of the objects (e.g., class) in your plot.
You can display a point in different ways by changing the values of its aesthetic properties.
color to the Plotshape to the Plotsize to the Plotcolor to the Plot, Manuallycolor to the Plot, Manuallyfill to the Plot, Manually
geom_*() has a different set of aesthetic parameters.
fill is available for geom_smooth(), not geom_point().size to the Plot, Manuallyalpha to the Plot, Manuallygeom_*() outside of aes()

\(\color{blue}{\log_{10}(x)}\) The base-\(10\) logarithm is called the common logarithm.
\(\log_{10}(100)\)
The base-10 logarithm of 100 is 2, because
\(10^{2} = 100\).
\(\log_{e}(x)\)
The base-\(e\) logarithm is called the natural logarithm,
where \(e = 2.718\ldots\) is the Euler’s number.
\(\color{blue}{\log(x)}\) or \(\ln(x)\)
Both denote the natural log of \(x\).
\(\log_{e}(7.389\ldots)\)
The natural log of 7.389⋯ is 2, because
\(e^{2} = 7.389\ldots\).


For a small change in a variable \(x\) from \(x_{0}\) to \(x_{1}\): \[ \Delta \log(x) = \log(x_{1}) - \log(x_{0}) \]
\[ \,\approx \frac{x_{1} - x_{0}}{x_{0}} \]
\[ = \frac{\Delta x}{x_{0}}\quad\; \]


ggplot()Try it out → Classwork 11: Relationship Plots.
color, shape, size) can sometimes make a plot cluttered and hard to interpret.facet_wrap(~ VAR)
facet_wrap().facet_wrap(~ VAR) with nrow
nrow determines the number of rows to use when laying out the facets.facet_wrap(~ VAR) with ncol
ncol determines the number of columns to use when laying out the facets.facet_wrap(~ VAR) with scales = "free_x"
scales = "free_x" allow for different scales of x-axisfacet_wrap(~ VAR) with scales = "free_y"
scales = "free_y" allow for different scales of y-axisfacet_wrap(~ VAR) with scales = "free"
scales = "free" allow for different scales of both x-axis and y-axisggplot()nvda data.frame includes NVIDIA’s stock information from 2019-01-02 to 2025-10-30.geom_line()
geom_line() draws a line by connecting data points in order of the variable on the x-axis.
geom_line() and geom_smooth()
geom_smooth() helps reveal underlying long-term trends by smoothing out variability in the observations.tech_october data.frame includes stock information about AAPL, MSFT, META, and NVDA in October 2025.

We can use the group, color, or linetype aesthetic to tell ggplot about the firm-level grouping structure in the dataset.
Try it out → Classwork 13: Time Trend Plots.
ggplot() - HistogramsHistograms are used to visualize the distribution of a numeric variable.
Histograms divide data into bins and count the number of observations in each bin.
geom_histogram()
geom_histogram() creates a histogram.
x aesthetic to a variable.geom_histogram() with binsbins: Specifies the number of binsgeom_histogram() with binwidthbinwidth: Specifies the width of each binbins option or the binwidth option.color and fill Aesthetics
fill: Fills the bars with a specific color.color: Adds an outline of a specific color to the bars.
Types of Colorblindness
shape to scatterplots or linetype to line chartsggthemes packagecolor aesthetic mapping: scale_color_colorblind() or scale_color_tableau()fill aesthetic mapping: scale_fill_colorblind() orscale_fill_tableau()theme_economist(), theme_wsj())ggthemes::scale_color_colorblind()
color in aes(), we can use scale_color_*()ggthemes::scale_color_tableau()
scale_color_tableau() provides color palettes used in Tableau.ggthemes::theme_economist()
theme_economist() approximates the style of The Economist.ggthemes::theme_wsj()
theme_wsj() approximates the style of The Wall Street Journal.ggplot() - Boxplots
geom_boxplot()geom_boxplot() creates a boxplot;
x and y aestheticsfill Aesthetic# 1. `show.legend = FALSE` turns off
# the legend information
# 2. `scale_fill_colorblind()` or
# `scale_fill_tableau()`
# applies a color-blind friendly
# palette to the `fill` aesthetic
# install.packages("ggthemes")
library(ggthemes)
ggplot(data = mpg,
mapping =
aes(x = hwy,
y = class,
fill = class)) +
geom_boxplot(
show.legend = FALSE) +
scale_fill_tableau() 
fill: Maps a variable to the fill colors used in the boxplot.ggthemes::scale_fill_tableau(): A colorblind-friendly Tableau-style palette for the fill aesthetic.fct_reorder(CATEGORICAL, NUMERICAL)
fct_reorder(CATEGORICAL, NUMERICAL): Reorders the categories of the CATEGORICAL by the median of the NUMERICAL.ggplot() - Bar ChartsBar charts are used to visualize the distribution of a categorical variable.
Bar charts display the count (or proportion) of observations for each category.
ggplot2::diamonds is a data.frame containing the prices and other attributes of almost 54,000 diamonds.geom_bar()geom_bar() creates a bar chart.
x or y aesthetic to the variable.count(): Counting Occurrences of Each Category in a Categorical Variablegeom_bar().
fill Aestheticfill aesthetic.count(): Counting Occurrences Across Two Categorical Variablescount() calculates the frequency of each unique combination of values across two categorical variables.diamonds |> count(cut, clarity) returns the data.frame with the three variables, cut, clarity, and n:
n: the number of occurrences of each unique combination of values in cut and clarityfill Aesthetic
clarity varies by cut, with total bar height for overall count and segments for each clarity level.fill Aesthetic & the position="fill"
clarity varies by cut, displaying the proportion of each clarity within each cut.fill Aesthetic & the position="dodge"clarity varies by cut, with separate bars for each clarity level within each cut category.fill Aesthetic and the position = "stack"position option is position = "stack"geom_bar()after_stat(prop): Calculates the proportion of the total count.group = 1: Ensures the proportions are calculated over the entire data.frame, not within each group of cutgeom_col()geom_col() creates bar charts where the height of bars directly represents values in a column in a given data.frame.
geom_col() requires both x- and y- aesthetics.fct_reorder(CATEGORICAL, NUMERICAL)
fct_reorder(CATEGORICAL, NUMERICAL): Reorders the categories of the CATEGORICAL by the median of the NUMERICAL.


Which type of bar chart is most effective for your data?
Which type of bar chart best meets your visualization goals?






Humans are generally better at judging lengths than angles.
When the integer is simply a label for a category, and the numeric order does not represent meaningful numeric differences.
Examples:
Age variable
Age variable
In ggplot, the distribution of an integer variable can look quite similar whether using geom_histogram() or geom_bar().
As shown above, in Python and other tools, these visualizations can behave differently, leading to noticeably different outputs.