ggplot basics
January 29, 2025
ggplot grammarggplot()A grammar of graphics is a tool that enables us to concisely describe the components of a graphic
An aesthetic is a visual property (e.g., size, shape, color) of the objects (e.g., class) in your plot.
You can display a point in different ways by changing the values of its aesthetic properties.
color to the plotshape to the plotsize to the plotalpha (transparency) to the plotMany points overlap each other.
When points overlap, it’s hard to know how many data points are at a particular location.
Overplotting can obscure patterns and outliers, leading to potentially misleading conclusions.
We can set a transparency level (alpha) between 0 (full transparency) and 1 (no transparency).
alphasizesize smaller than 1.color to the plot
geom_ function; i.e. it goes outside of aes().
color as a character string.size of a point in mm.shape of a point as a number, as shown below.color to the plot?ggplot()ggplot2 graphics is to put the + in the wrong place.facet_wrap().facet_grid( VAR_ROW ~ VAR_COL ) to our plot call.facet_grid() is also a formula.
~.scales in facet_*() is whether scales is
"fixed", the default),"free_x", "free_y"), or"free").How are these two plots similar?
geom_*() is the geometrical object that a plot uses to represent data.
geom_bar();geom_line();geom_boxplot();geom_point();geom_smooth();geom_*() to plot the same data.Every geom function in ggplot2 takes a mapping argument.
However, not every aesthetic works with every geom.
shape of a point, but you couldn’t set the shape of a line;linetype of a line.method = lm manually in geom_smooth() gives a straight line that fits into data points.group aesthetic to a categorical variable to draw multiple objects.
ggplot2 will draw a separate object for each unique value of the grouping variable.ggplot2 will automatically group the data for these geoms whenever we map an aesthetic to a discrete variable (as in the linetype example).geom_*() functions to ggplot():geom_point(), geom_smooth(), and geom_smooth(method = lm) together is an excellent option to visualize the relationship between the two variables.ggplot2 will treat them as local mappings for the layer.We can use the same idea to specify different data for each layer.
Here, our smooth line displays just a subset of the mpg data.frame, the subcompact cars.
filter() is the tidyverse-way to filter observations in a data.frame.The local data argument in geom_smooth() overrides the global data argument in ggplot() for that layer only.
The standard error (se) tells us how much the predicted values from a model might differ from the actual values we’re trying to predict.
Many graphs, including bar charts, calculate new values to plot:
geom_bar(), geom_histogram(), and geom_freqpoly() bin our data and then plot bin counts, the number of observations that fall in each bin.
geom_boxplot() computes a summary of the distribution and then display a specially formatted box.
geom_smooth() fits a model to our data and then plot predictions from the model.
geom_histogram() is a continuous version of a bar chart.geom_histogram(), we should experiment on either bins or binwidth.geom_freqpoly() is a line version of a histogram.log()
\(\log_{10}\,(\,100\,)\): the base \(10\) logarithm of \(100\) is \(2\), because \(10^{2} = 100\)
\(\log_{e}\,(\,x\,)\): the base \(e\) logarithm is called the natural log, where \(e = 2.718\cdots\) is the mathematical constant, the Euler’s number.
\(\log\,(\,x\,)\) or \(\ln\,(\,x\,)\): the natural log of \(x\) .
\(\log_{e}\,(\,7.389\cdots\,)\): the natural log of \(7.389\cdots\) is \(2\), because \(e^{2} = 7.389\cdots\).
log()sale_df contains data for residential property sales from September 2017 and August 2018 in NYC.
sale.price, a property’s sales price.log()\[\Delta \log(x) \,= \, \log(x_{1}) \,-\, \log(x_{0}) \approx\, \frac{x_{1} \,-\, x_{0}}{x_{0}} \,=\, \frac{\Delta\, x}{x_{0}}.\]
sale.price of $10,000 means something very different across people with different income/wealth levels.log()Bar charts seem simple, but they are interesting because they reveal something subtle about plots.
Consider a basic bar chart, as drawn with geom_bar().
The following bar chart displays the total number of diamonds in the ggplot2::diamonds data.frame, grouped by cut.
The diamonds data.frame comes in ggplot2 and contains information about ~54,000 diamonds, including the price, carat, color, clarity, and cut of each diamond.
geom_bar() bins our data and then plot bin counts, the number of observations that fall in each bin.The algorithm used to calculate new values for a graph is called a stat, short for statistical transformation.
The figure below describes how this process works with geom_bar().
stat explicitly:color and fill aestheticcolor aesthetic, or, more usefully, fill.fillfill aestheticfill aesthetic to another variable.fill aestheticstacking is performed automatically by the position adjustment specified by the position argument.If we don’t want a stacked bar chart with counts, we can use one of two other position options: fill or dodge.
position = "fill" works like stacking, but makes each set of stacked bars the same height.
position = "dodge" places overlapping objects directly beside one another.
position = "fill"position = "dodge"geom_abline()geom_abline() do?coord_fixed()coord_fixed() do?ggplot Grammar