Data Visualization with ggplot
November 5, 2025
ggplot - First Steps
The mpg data frame, provided by ggplot2, contains observations collected by the US Environmental Protection Agency on 38 models of car.
Q. Do cars with big engines use more fuel than cars with small engines?
displ: a car’s engine size, in liters.hwy: a car’s fuel efficiency on the highway, in miles per gallon (mpg).What does the relationship between engine size and fuel efficiency look like?
ggplotmpg, run the above code to put displ on the x-axis and hwy on the y-axis.A ggplot graphic is a mapping of variables in data to aesthetic attributes of geometric objects.
Three Essential Components in ggplot() Graphics:
data: data.frame containing the variables of interest.geom_*(): geometric object in the plot (e.g., point, line, bar, histogram, boxplot).aes(): aesthetic attributes of the geometric object (e.g., x-axis, y-axis, color, shape, size, fill) mapped to variables in the data.frame.ggplotggplot():
data = mpggeom_point()aes(x = displ, y = hwy)ggplot()geom_point()geom_smooth()geom_point() with geom_smooth()
geom_smooth() draws a smooth curve fitted to the data.ggplot() workflowggplot()ggplot2 graphics is to put the + in the wrong place.
+ at the end of the previous line, NOT at the beginning of the next line.geom_smooth()Using regression—one of the machine learning methods—the geom_smooth() visualizes the predicted value of the y variable for a given value of the x variable.
What Does the Grey Ribbon Represent?
x and y variables falls within the grey ribbon.geom_point() with geom_smooth(method = lm)method = "lm" specifies that a linear model (lm), called a linear regression model.ggplot()mpg data.frame?Many points overlap each other.
When points overlap, it’s hard to know how many data points are at a particular location.
Overplotting can obscure patterns and outliers, leading to potentially misleading conclusions.
alpha
alpha) between 0 (full transparency) and 1 (no transparency) manually.alphaaes() function but within the geom_*() function.
In the plot above, one group of points (highlighted in red) seems to fall outside of the linear trend.
An aesthetic is a visual property (e.g., size, shape, color) of the objects (e.g., class) in your plot.
You can display a point in different ways by changing the values of its aesthetic properties.
color to the Plotshape to the Plotsize to the Plotcolor to the Plot, Manuallycolor to the Plot, Manuallyfill to the Plot, Manually
geom_*() has a different set of aesthetic parameters.
fill is available for geom_smooth(), not geom_point().size to the Plot, Manuallyalpha to the Plot, Manuallygeom_*() outside of aes()

\(\log_{10}(100)\) — the base-10 logarithm of 100 is 2, because
\(10^{2} = 100\).
\(\log_{e}(x)\) — the base-\(e\) logarithm is called the natural logarithm,
where \(e = 2.718\ldots\) is the Euler’s number.
\(\log(x)\) or \(\ln(x)\) — both denote the natural log of \(x\).
\(\log_{e}(7.389\ldots)\) — the natural log of 7.389⋯ is 2, because
\(e^{2} = 7.389\ldots\).


For a small change in a variable \(x\) from \(x_{0}\) to \(x_{1}\): \[ \begin{aligned} \Delta \log(x) &= \log(x_{1}) - \log(x_{0})\\ &\approx \frac{x_{1} - x_{0}}{x_{0}}\\ &= \frac{\Delta x}{x_{0}} \end{aligned} \]


log(gdpPercap) corresponds to a 100% increase (doubling) in gdpPercap.
facet_wrap(~ VAR)
facet_wrap().facet_wrap(~ VAR) with nrow
nrow determines the number of rows to use when laying out the facets.facet_wrap(~ VAR) with ncol
ncol determines the number of columns to use when laying out the facets.facet_wrap(~ VAR) with scales = "free_x"
scales = "free_x" allow for different scales of x-axisfacet_wrap(~ VAR) with scales = "free_y"
scales = "free_y" allow for different scales of y-axisfacet_wrap(~ VAR) with scales = "free"
scales = "free" allow for different scales of both x-axis and y-axisggplot()nvda data.frame includes NVIDIA’s stock information from 2019-01-02 to 2024-10-18.geom_line()
geom_line() draws a line by connecting data points in order of the variable on the x-axis.
geom_line() and geom_smooth()
geom_smooth() can also be useful for illustrating overall time trends.tech_october data.frame includes stock information about AAPL, MSFT, META, and NVDA in October 2025.ggplot()
ggplot()
group, color, or linetype aesthetic to tell ggplot explicitly about this firm-level structure.ggplot() - Histogramgeom_histogram()Histograms are used to visualize the distribution of a numeric variable.
Histograms divide data into bins and count the number of observations in each bin.
geom_histogram()
geom_histogram() creates a histogram.
x aesthetic to the variable.geom_histogram() with binsbins: Specifies the number of binsgeom_histogram() with binwidthbinwidth: Specifies the width of each binbins option or the binwidth option.
fill: Fills the bars with a specific color.color: Adds an outline of a specific color to the bars.
Types of Colorblindness
Roughly 8% of men and half a percent of women are colorblind.
There are several techniques to make visualization more colorblind-friendly:
shape for scatterplots and linetype for line chartsggthemes packageggthemes package provides various themes for ggplot2 visualization:
scale_color_colorblind(), scale_color_tableau()theme_economist(), theme_wsj()ggthemes::scale_color_colorblind()
color in aes(), we can use scale_color_*()ggthemes::scale_color_tableau()
scale_color_tableau() provides color palettes used in Tableau.ggthemes::theme_economist()
theme_economist() approximates the style of The Economist.ggthemes::theme_wsj()
theme_wsj() approximates the style of The Wall Street Journal.ggplot() - Boxplotgeom_boxplot()
geom_boxplot()geom_boxplot() creates a boxplot;
x and y aesthetics# 1. `show.legend = FALSE` turns off
# the legend information
# 2. `scale_fill_colorblind()` or
# `scale_fill_tableau()`
# applies a color-blind friendly
# palette to the `fill` aesthetic
# To use the scale_fill_tableau():
library(ggthemes)
ggplot(data = mpg,
mapping =
aes(x = hwy,
y = class,
fill = class)) +
geom_boxplot(
show.legend = FALSE) +
scale_fill_tableau() 
fill: Maps a variable to the fill color of the boxes.scale_fill_tableau(): A color-blind friendly palette to the fill aesthetic.fct_reorder(CATEGORICAL, NUMERICAL)
fct_reorder(CATEGORICAL, NUMERICAL): Reorders the categories of the CATEGORICAL by the median of the NUMERICAL.ggplot() - Bar Chartgeom_bar()Bar charts are used to visualize the distribution of a categorical variable.
geom_bar() divides data into bins and count the number of observations in each bin.
geom_bar()geom_bar() creates a bar chart.
x or y aesthetic to the variable.geom_bar().
fill Aestheticfill aesthetic.count(): Counting Occurrences Across Two Categorical Variablescount() calculates the frequency of each unique combination of values across two categorical variables.diamonds |> count(cut, clarity) returns the data.frame with the three variables, cut, clarity, and n:
n: the number of occurrences of each unique combination of values in cut and clarityfill Aesthetic
clarity varies by cut, with total bar height for overall count and segments for each clarity level.fill Aesthetic & the position="fill"
clarity varies by cut, displaying the proportion of each clarity within each cut.fill Aesthetic & the position="dodge"clarity varies by cut, with separate bars for each clarity level within each cut category.


Which type of bar chart is most effective for your data?
Which type of bar chart best meets your visualization goals?



fill Aesthetic and the position = "stack"position option is position = "stack"geom_bar()after_stat(prop): Calculates the proportion of the total count.group = 1: Ensures the proportions are calculated over the entire data.frame, not within each group of cutgeom_col()geom_col() creates bar charts where the height of bars directly represents values in a column in a given data.frame.
geom_col() requires both x- and y- aesthetics.fct_reorder(CATEGORICAL, NUMERICAL)
fct_reorder(CATEGORICAL, NUMERICAL): Reorders the categories of the CATEGORICAL by the median of the NUMERICAL.


Pie charts work well only if you only have a few categories—four max.
Pie charts work well if the goal is to emphasize simple fractions (e.g., 25%, 50%, or 75%).
For data visualization, integer-type variables could be treated as either categorical (discrete) or numeric (continuous), depending on the context of analysis.
If the values of an integer-type variable means an intensity or an order, the integer variable could be numeric.
If not, the integer variable is categorical.
Age variable
Age variable
geom_bar() and geom_histogram().