Lecture 5

Byeong-Hak Choe

SUNY Geneseo

February 5, 2025

Grouped data and the group aesthetic

  • Let’s get a line plot that draws the trajectory of life expectancy over time for each country in the gapminder data.frame.
p <- ggplot(data = gapminder,
            mapping = 
              aes( x = year,
                   y = gdpPercap ) )
p + 

Grouped data and the group aesthetic

  • What happened?

  • geom_line() joins up all the lines for each particular year in the order they appear in the dataset.

Grouped data and the group aesthetic

  • Without group related parameters, ggplot() does not know that the yearly observations in the data are grouped by country.
p <- ggplot(data = gapminder,
            mapping = 
              aes( x = year,
                   y = gdpPercap ) )

p + 
  geom_line( aes( group = country ) ) 

Grouped data and the group aesthetic

  • The group aesthetic is usually only needed when the grouping information we need to tell ggplot() about is not built-in to the variables being mapped.

Grouped data and the group aesthetic

  • How about color aesthetic, instead of group?
p <- ggplot(data = gapminder,
            mapping = 
              aes( x = year,
                   y = gdpPercap ) )

p + 
  geom_line( aes( color = country ) ) 

Grouped data and the group aesthetic

p <- ggplot(data = gapminder,
            mapping = 
              aes( x = year,
                   y = gdpPercap ) )

p + 
  geom_line( aes( color = country ),
               show.legend = F) 

Facet to make small multiples

  • Making a “small multiple” plot by faceting data based on a caterigorical variable allows a lot of information to be presented compactly, and in a consistently comparable way.
    • facet_wrap( VAR1 ~ . ) or facet_wrap( . ~ VAR1 )
    • facet_grid( VAR1 ~ . ): row-wise split
    • facet_grid( . ~ VAR1 ): colum-wise split
    • facet_grid( VAR1 ~ VAR2 )

Facet to make small multiples

p + 
  geom_line( aes( group = country ) ) + 
  facet_wrap(~ continent)

Facet to make small multiples

  • Let’s have all the facetted plots in a single row:
p + 
            aes(group = country)) +
  geom_smooth(size = 1.1, 
              method = "loess", 
              se = FALSE) +
  facet_wrap(.~ continent, nrow = 1) +
  scale_y_log10(labels=scales::dollar) + 
  theme(axis.text.x = 
            angle = 45),
        axis.title.x = 
            margin = margin(t = 25))) +
  labs(x = "Year", 
       y = "GDP per capita",
       title = "GDP per capita on Five Continents")

The 2016 General Social Survey data

  • The socviz package includes the gss_sm data frame.
    • gss_sm is a dataset containing an extract from the 2016 General Social Survey.
# install.packages("socviz")
gss_sm <- gss_sm

Facet to make small multiples

  • Describe the relationship between the age of the respondent and the number of children they have using a scatterplot and a fitted curve.
p <- ggplot(data = gss_sm,
            mapping = 
              aes( x = age, 
                   y = childs ))

p + 
  geom_point(alpha = 0.2) +

Facet to make small multiples

  • Describe how the relationship between the age of the respondent and the number of children they have varies by sex and race.
p <- ggplot(data = gss_sm,
            mapping = 
              aes( x = age, 
                   y = childs ))

p + 
  geom_point(alpha = 0.2) +
  geom_smooth() +
  facet_grid(sex ~ race)

Facet to make small multiples

  • The facet_grid() function is best used when you cross-classify some data by two categorical variables.
p <- ggplot(data = gss_sm,
            mapping = 
              aes( x = age, 
                   y = childs ))

p + 
  geom_point(alpha = 0.2) +
  geom_smooth() +
  facet_grid(sex ~ race + degree)

Geoms can transform data

  • Let’s plot a bar char:
p <- ggplot(data = gss_sm,
            mapping = 
              aes(x = bigregion))
p + 

Geoms can transform data

  • Where does count come from?
    • Bar charts, histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall in each bin.
    • Smoothers fit a model to your data and then plot predictions from the model.
    • Boxplots compute a robust summary of the distribution and then display a specially formatted box.

Geoms can transform data

  • If we want a chart of relative frequencies rather than counts, we will need to get the prop statistic instead.

  • Our call to statistic from the aes() function generically looks like this:

    • <mapping> = <..statistic..>;
    • <mapping> = stat(statistic); or
    • <mapping> = after_stat(statistic).

Geoms can transform data

p <- ggplot(data = gss_sm,
            mapping = 
              aes(x = bigregion))
p + 
  geom_bar(mapping = 
             aes(y = ..prop..))

Geoms can transform data

  • We need to tell ggplot to ignore the x-categories when calculating denominator of the proportion, and use the total number observations instead.

Geoms can transform data

  • To do so we specify group = 1 inside the aes() call.
p <- ggplot(data = gss_sm,
            mapping = 
              aes(x = bigregion))
p + 
  geom_bar(mapping = 
             aes(y = ..prop.., 
                 group = 1)) 

Geoms can transform data

  • Let’s look at another question from the survey. The gss_sm data contains a religion variable derived from a question asking:

    • “What is your religious preference? Is it Protestant, Catholic, Jewish, some other religion, or no religion?”
gss_sm |> 
  group_by(religion) |> 

Geoms can transform data

  • If we map religion to color, only the border lines of the bars will be assigned colors, and the insides will remain gray.
p <- ggplot(data = gss_sm,
            mapping = 
              aes(x = religion, 
                  color = religion))
p + 

Geoms can transform data

  • If the gray bars look boring and we want to fill them with color instead, we can map the religion variable to fill in addition to mapping it to x.

  • If we set guides(fill = "none"), the legend about the fill mapping is removed.

Geoms can transform data

p <- ggplot(data = gss_sm,
            mapping = 
              aes(x = religion, 
                  fill = religion))
p + 
  geom_bar() + 
  guides( fill = "none" )

Frequency plots the slightly awkward way

  • A more appropriate use of the fill aesthetic with geom_bar() is to cross-classify two categorical variables.

    • The default output of such geom_bar() is a stacked bar chart, with counts on the y-axis.

Frequency plots the slightly awkward way

  • An alternative choice is to set the position argument to "fill".
p <- ggplot(data = gss_sm,
            mapping = 
              aes(x = bigregion, 
                  fill = religion))
p + 
  geom_bar(position = "fill")

Frequency plots the slightly awkward way

  • We can use position = "dodge" to make the bars within each region of the country appear side by side.
p <- ggplot(data = gss_sm,
            mapping = 
              aes(x = bigregion, 
                  fill = religion))
p + 
  geom_bar(position = "dodge",
           mapping = 
             aes(y = ..prop..))

Frequency plots the slightly awkward way

  • In this case we should consider grouping variable, religion, so we map religion to the group aesthetic.
p <- ggplot(data = gss_sm,
            mapping = 
              aes(x = bigregion, 
                  fill = religion))
p + 
  geom_bar(position = "dodge",
           mapping = 
             aes(y = ..prop.., 
                 group = religion))

Frequency plots the slightly awkward way

  • How can we have a proportional bar chart such that the sum of all bars in each bigregion is 1?

    • There are various ways to do so, and faceting is one of them.
    • The proportions are calculated within each panel, which is the breakdown we wanted.

Frequency plots the slightly awkward way

p <- ggplot(data = gss_sm,
            mapping = 
              aes(x = religion))
p + 
  geom_bar(position = "dodge",
           mapping = 
             aes(y = ..prop.., 
                 group = bigregion)) +
  facet_wrap(~ bigregion, 
             ncol = 1)

Histograms and density plots

  • ggplot comes with a dataset, midwest, containing information on counties in several midwestern states of the USA.

Histograms and density plots

  • By default, the geom_histogram() function will choose a bin size for us based on a rule of thumb.
p <- ggplot(data = midwest,
            mapping = 
              aes(x = area))
p + 

Histograms and density plots

  • When drawing histograms it is worth experimenting with bins and also optionally the origin of the x-axis.
p <- ggplot(data = midwest,
            mapping = 
              aes(x = area))
p + 
  geom_histogram(bins = 10)

Histograms and density plots

  • While histograms summarize single variables, it’s also possible to use several at once to compare distributions.
    • We can facet histograms by some variable of interest.
    • We can also compare them in the same plot using the fill mapping.

Histograms and density plots

p <- ggplot(data = 
                     state %in% 
                       c("OH", "WI")),
            mapping = 
              aes(x = percollege, 
                  fill = state) )
p + 
  geom_histogram(alpha = 0.4, 
                 bins = 20)

Histograms and density plots

  • When working with a continuous variable, an alternative to binning the data and making a histogram is to calculate a kernel density estimate of the underlying distribution with geom_density().

Histograms and density plots

p <- ggplot(data = midwest,
            mapping = 
              aes(x = area))
p + 

Histograms and density plots

  • Here we can use color (for the lines) and fill (for the body of the density curve) for aesthetic mappings.
p <- ggplot(data = midwest,
            mapping = 
              aes(x = area, 
                  fill = state, 
                  color = state))
p + 
  geom_density(alpha = 0.3)

Histograms and density plots

  • For geom_density(), the stat_density() function can return its default after_stat(density) statistic, or after_stat(scaled), which will give a proportional density estimate.

Histograms and density plots

p <- ggplot(data = 
                     state %in% 
                       c("OH", "WI")),
            mapping = 
              aes(x = area, 
                  fill = state, 
                  color = state))
p + 
  geom_density( alpha = 0.3, 
                mapping = 
                  aes(y = after_stat(scaled) ))

Avoid transformations when necessary

  • When we call geom_bar(), it does its calculations on the fly using stat_count() behind the scenes to produce the counts or proportions it displays.

Avoid transformations when necessary

  • But often, our data is in effect already a summary table.

  • Let’s consider the socviz::titanic data.frame.


Avoid transformations when necessary

  • Should we avoid transforming data if we want to describe the relationship between fate and percent?
p <- ggplot(data = titanic,
            mapping = 
              aes(x = fate, 
                  y = percent, 
                  fill = sex))
p + 
  geom_bar(position = "dodge", 
           stat = "identity") +
  theme(legend.position = "top")

Avoid transformations when necessary

  • geom_col() has exactly the same as geom_bar() except that it assumes that stat = "identity".

  • Let’s consider socviz::oecd_sum data.frame.

    • It contains information on average life expectancy at birth within the United States, and across other OECD countries.
  • Let’s draw the bar chart that describe the difference over time using color = hi_lo.

Avoid transformations when necessary

p <- ggplot(data = socviz::oecd_sum,
            mapping = 
              aes(x = year, 
                  y = diff, 
                  fill = hi_lo))
p + 
  geom_col() + 
  guides(fill = "none") +
  labs(x = NULL, 
       y = "Difference in Years",
       title = "The US Life Expectancy Gap",
       subtitle = "Difference between US and OECD
                   average life expectancies, 1960-2015",
       caption = "Data: OECD. After a chart by Christopher Ingraham,
                  Washington Post, December 27th 2017.") +