ggplot basics
January 29, 2025
ggplot
grammarggplot()
A grammar of graphics is a tool that enables us to concisely describe the components of a graphic
An aesthetic is a visual property (e.g., size
, shape
, color
) of the objects (e.g., class
) in your plot.
You can display a point in different ways by changing the values of its aesthetic properties.
color
to the plotshape
to the plotsize
to the plotalpha
(transparency) to the plotMany points overlap each other.
When points overlap, it’s hard to know how many data points are at a particular location.
Overplotting can obscure patterns and outliers, leading to potentially misleading conclusions.
We can set a transparency level (alpha
) between 0 (full transparency) and 1 (no transparency).
alpha
size
size
smaller than 1.color
to the plot
geom_
function; i.e. it goes outside of aes()
.
color
as a character string.size
of a point in mm.shape
of a point as a number, as shown below.color
to the plot?ggplot()
ggplot2
graphics is to put the +
in the wrong place.facet_wrap()
.facet_grid( VAR_ROW ~ VAR_COL )
to our plot call.facet_grid()
is also a formula.
~
.scales
in facet_*()
is whether scales is
"fixed"
, the default),"free_x"
, "free_y"
), or"free"
).How are these two plots similar?
geom_*()
is the geometrical object that a plot uses to represent data.
geom_bar()
;geom_line()
;geom_boxplot()
;geom_point()
;geom_smooth()
;geom_*()
to plot the same data.Every geom function in ggplot2
takes a mapping argument.
However, not every aesthetic works with every geom
.
shape
of a point, but you couldn’t set the shape
of a line;linetype
of a line.method = lm
manually in geom_smooth()
gives a straight line that fits into data points.group
aesthetic to a categorical variable to draw multiple objects.
ggplot2
will draw a separate object for each unique value of the grouping variable.ggplot2
will automatically group the data for these geoms
whenever we map an aesthetic to a discrete variable (as in the linetype
example).geom_*()
functions to ggplot()
:geom_point()
, geom_smooth()
, and geom_smooth(method = lm)
together is an excellent option to visualize the relationship between the two variables.ggplot2
will treat them as local mappings for the layer.We can use the same idea to specify different data for each layer.
Here, our smooth line displays just a subset of the mpg
data.frame, the subcompact
cars.
filter()
is the tidyverse-way to filter observations in a data.frame.The local data argument in geom_smooth()
overrides the global data argument in ggplot()
for that layer only.
The standard error (se
) tells us how much the predicted values from a model might differ from the actual values we’re trying to predict.
Many graphs, including bar charts, calculate new values to plot:
geom_bar()
, geom_histogram()
, and geom_freqpoly()
bin our data and then plot bin counts, the number of observations that fall in each bin.
geom_boxplot()
computes a summary of the distribution and then display a specially formatted box.
geom_smooth()
fits a model to our data and then plot predictions from the model.
geom_histogram()
is a continuous version of a bar chart.geom_histogram()
, we should experiment on either bins
or binwidth
.geom_freqpoly()
is a line version of a histogram.log()
\(\log_{10}\,(\,100\,)\): the base \(10\) logarithm of \(100\) is \(2\), because \(10^{2} = 100\)
\(\log_{e}\,(\,x\,)\): the base \(e\) logarithm is called the natural log, where \(e = 2.718\cdots\) is the mathematical constant, the Euler’s number.
\(\log\,(\,x\,)\) or \(\ln\,(\,x\,)\): the natural log of \(x\) .
\(\log_{e}\,(\,7.389\cdots\,)\): the natural log of \(7.389\cdots\) is \(2\), because \(e^{2} = 7.389\cdots\).
log()
sale_df
contains data for residential property sales from September 2017 and August 2018 in NYC.
sale.price
, a property’s sales price.log()
\[\Delta \log(x) \,= \, \log(x_{1}) \,-\, \log(x_{0}) \approx\, \frac{x_{1} \,-\, x_{0}}{x_{0}} \,=\, \frac{\Delta\, x}{x_{0}}.\]
sale.price
of $10,000 means something very different across people with different income/wealth levels.log()
Bar charts seem simple, but they are interesting because they reveal something subtle about plots.
Consider a basic bar chart, as drawn with geom_bar()
.
The following bar chart displays the total number of diamonds in the ggplot2::diamonds
data.frame, grouped by cut
.
The diamonds
data.frame comes in ggplot2
and contains information about ~54,000 diamonds, including the price
, carat
, color
, clarity
, and cut
of each diamond.
geom_bar()
bins our data and then plot bin counts, the number of observations that fall in each bin.The algorithm used to calculate new values for a graph is called a stat
, short for statistical transformation.
The figure below describes how this process works with geom_bar()
.
stat
explicitly:color
and fill
aestheticcolor
aesthetic, or, more usefully, fill
.fill
fill
aestheticfill
aesthetic to another variable.fill
aestheticstack
ing is performed automatically by the position adjustment specified by the position
argument.If we don’t want a stacked bar chart with counts, we can use one of two other position
options: fill
or dodge
.
position = "fill"
works like stacking, but makes each set of stacked bars the same height.
position = "dodge"
places overlapping objects directly beside one another.
position = "fill"
position = "dodge"
geom_abline()
geom_abline()
do?coord_fixed()
coord_fixed()
do?ggplot
Grammar