Lecture 6

Graph tables, add labels, make notes

Byeong-Hak Choe

SUNY Geneseo

February 12, 2025

Data Visualization with ggplot()

Data Visualization with ggplot()

  • Step 1. Figure out whether variables of interests are categorical or continuous.

  • Step 2. Think which geometric objects, aesthetic mappings, and faceting are appropriate to visualize distributions and relationships.

  • Step 3. If needed, transform a given data.frame (e.g., filtered observations, new variables, summarized data) and try new visualizations.

Data Visualization with ggplot()

Geometric objects

  • A distribution of a categorical variable (e.g., geom_bar() and more)
  • A distribution of a continuous variable (e.g., geom_histogram() and more)
  • A relationship between two categorical variables (e.g., geom_bar() and more)
  • A relationship between two continuous variables (e.g., geom_point() with geom_smooth() and more)
  • A relationship between a categorical variable and a continuous variable (e.g., geom_boxplot() and more)
  • A time trend of a categorical variable (e.g., geom_bar() and more)
  • A time trend of a continuous variable (e.g., geom_line() and more)

Graph tables, add labels, make notes

Graph tables, add labels, make notes

  • We will learn about how to transform data.frame before we send it to ggplot to be turned into a figure.
    • We will learn how to use some of dplyr’s “action verbs” to filter, select, group, mutate, summarize and transform our data.
  • We will expand the number of geoms we know about, and learn more about how to choose between them.
    • Different geoms potentially requires different aesthetic mappings.
  • We discuss the five key dplyr functions to solve various data manipulation challenges

Data Transformation

dplyr basics

  • Filter observations by logical conditions about values of variables (filter()).
  • Arrange/sort rows (arrange()).
  • Select variables by their names (select()).
  • Rename variables by their names (rename()).
  • Create new variables with functions of existing variables (mutate()).
  • Relocate existing variables by their names (relocate()).
  • Collapse a data.frame down to a summarized version of it (summarize()).
  • Group a data.frame by a categorical variable (group_by()).
  • Reference: R for Data Science

Data Transformation

dplyr basics

Tools -> Global Options -> Code -> Check “User native pipe operator”

Data Transformation

  • Because the first argument is a data.frame and the output is a data.frame, dplyr verbs work well with the pipe, |>
    • Ctrl + Shift + M for Windows; command + Shift + M for Mac.
  • The pipe (|>) takes the thing on its left and passes it along to the function on its right so that
    • f(x, y) is equivalent to x |> f(y).
    • e.g., filter(DATA_FRAME, LOGICAL_STATEMENT) is equivalent to DATA_FRAME |> filter(LOGICAL_STATEMENT).
  • The easiest way to pronounce the pipe (|>) is “then”.
    • The pipe (|>) is super useful when we have a chain of data transforming operations to do.

Data Transformation

dplyr basics

  • DATA_FRAME |> filter(LOGICAL_CONDITIONS)

  • DATA_FRAME |> arrange(VARIABLES)

  • DATA_FRAME |> select(VARIABLES)

  • DATA_FRAME |> rename(NEW_VAR = EXISTING_VAR)

  • DATA_FRAME |> mutate(NEW_VARIABLE = ... )

  • DATA_FRAME |> relocate(VARIABLES)

  • DATA_FRAME |> group_by(VARIABLES)

  • DATA_FRAME |> summarize(NEW_VARIABLE = ...)

  • The subsequent arguments describe what to do with the data.frame, mostly using variable names.

Graph tables, add labels, make notes

Use pipes to summarize data

  • Let’s describe how the distribution of religious preferences varies by regions in the US using the socviz::gss_sm data.frame.

Graph tables, add labels, make notes

Use pipes to summarize data

  • Group the data into the nested structure we want for our summary, such as “Religion by Region” or “Authors by Publications by Year”.

  • Filter or select pieces of the data by row, column, or both.

  • Mutate the data by creating new variables at the current level of grouping.

    • This adds new columns to the table without aggregating it.
  • Summarize or aggregate the grouped data.

    • This creates new variables (e.g., means with mean(), sums with sum(), and counts with n()) at a higher level of grouping.

Graph tables, add labels, make notes

Use pipes to summarize data

rel_by_region <- gss_sm |>
    group_by( bigregion, religion ) |>
    summarize( N = n() ) |>
    mutate( freq = N / sum(N),
            pct = round( (freq*100), 0) )
  • We use the dplyr functions, group_by(), filter(), select(), mutate(), and summarize(), to carry out these data transformation tasks within our pipeline (|>, Ctrl/Cmd + Shift + M).

  • Here we create a new data.frame called rel_by_region.

Graph tables, add labels, make notes

Use pipes to summarize data

p <- ggplot( rel_by_region, 
             aes( x = bigregion, 
                  y = pct, 
                  fill = religion))

p + geom_col( position = "dodge2" ) +
    labs(x = "Region", 
         y = "Percent", 
         fill = "Religion") +
    theme(legend.position 
            = "top") 

  • Now that we are working directly with percentage values in a summary table, we can use geom_col() instead of geom_bar().

Graph tables, add labels, make notes

Use pipes to summarize data

Graph tables, add labels, make notes

Continuous variables by group or category

organdata
skimr::skim(organdata)
  • Let’s move to a new dataset, the socviz::organdata data.frame.

Graph tables, add labels, make notes

Continuous variables by group or category

p <- ggplot(data = organdata,
            mapping = 
              aes(x = year, 
                  y = donors))
p + geom_point()

Graph tables, add labels, make notes

Continuous variables by group or category

p <- ggplot(data = organdata,
            mapping = aes(x = year, 
                          y = donors))
p + 
  geom_line(aes(group = country)) + 
  facet_wrap(~ country) +
  theme(axis.text.x = 
          element_text(angle = 45))

  • The above describes a yearly trend of donors for each country.

Graph tables, add labels, make notes

Continuous variables by group or category

  • Let’s focus on the country-level variation of donors using geom_boxplot(), but without paying attention to the time trend.

Graph tables, add labels, make notes

Continuous variables by group or category

levels(organdata$country)
organdata <- organdata |>
  mutate(country = 
           fct_reorder(country, 
                       donors, na.rm = T) ) 
  • We can reorder the levels using fct_reorder(f, x, fun), which can take three arguments.
    • f: the factor whose levels we want to modify.
    • x: a numeric vector that we want to use to reorder the levels.
    • Optionally, fun: a function that’s used if there are multiple values of x for each value of f. The default value is median.

Graph tables, add labels, make notes

Continuous variables by group or category

p <- ggplot(data = organdata,
            mapping = 
              aes(x = country, 
                  y = donors))
p + geom_boxplot() + 
  labs(x = NULL) + 
  coord_flip()

Graph tables, add labels, make notes

Continuous variables by group or category

p <- ggplot(data = organdata,
            mapping = 
              aes(x = 
                    fct_reorder(
                      country, 
                      donors, 
                      na.rm=TRUE),
                  y = donors, 
                  fill = world))
p + geom_boxplot() + 
  labs(x=NULL) +
  coord_flip() + 
  theme(legend.position = "top")

Graph tables, add labels, make notes

Continuous variables by group or category

  • Sometimes it is better to sort the data with a categorical variable when plotting a bar chart or a Cleveland dotplot.

Graph tables, add labels, make notes

Continuous variables by group or category

by_country <- organdata |> 
  group_by(consent_law, country) |>
  summarize(donors_mean= mean(donors, na.rm = TRUE),
            donors_sd = sd(donors, na.rm = TRUE),
            gdp_mean = mean(gdp, na.rm = TRUE),
            health_mean = mean(health, na.rm = TRUE),
            roads_mean = mean(roads, na.rm = TRUE),
            cerebvas_mean = mean(cerebvas, na.rm = TRUE))
  • Summarize the data.frame organdata to calculate the mean and the standard deviation of each numeric variable for each consent_law-country pair.

  • Would there be a simpler way to do the task above?

Graph tables, add labels, make notes

Continuous variables by group or category

by_country <- organdata |> 
  group_by(consent_law, country) |>
  summarize_if(is.numeric, lst(mean, sd), na.rm = TRUE) |>
  ungroup()
  • What we would like to do is apply the mean() and sd() functions to every numerical variable in organdata, but only the numerical ones.

    • summarize_if( is.numeric, lst(mean, sd), na.rm = T) works really well.

Graph tables, add labels, make notes

Continuous variables by group or category

Graph tables, add labels, make notes

Continuous variables by group or category

  • Cleveland dotplots are generally preferred to bar or column charts.
    • When making them, put the categories on the y-axis and order them in the way that is most relevant to the numerical summary you are providing.
    • This sort of plot is also an excellent way to summarize model results or any data with error ranges.
  • Using geom_pointrange(), we can tell ggplot to show us a point estimate and a range around it.
    • With geom_pointrange(), we map our x and y variables as usual, but the function needs a little more information than geom_point(), for example (ymin, ymax) or (xmin, xmax).

Graph tables, add labels, make notes

Dot-and-whisker plot

Understanding Scales, Guides, and Themes

Understanding Scales, Guides, and Themes

  • We have used scale_x_log10(), scale_x_continuous() and otherscale_*_*() functions to adjust axis labels.

  • We used the guides() function to remove the legends for a color mapping and a label mapping.

  • We also used the theme() function to move the position of a legend from the side to the top of a figure.

  • What are the differences between the scale_*_*() functions, the guides() function, and the theme() function?

  • When do we know to use one rather than the other? Why are there so many scale_*_*() functions? How can we tell which one we need?

Understanding Scales, Guides, and Themes

  • Here is a rough and ready starting point:

  • Every aesthetic mapping has a scale.

    • If we want to adjust how that scale is marked or graduated, then we use a scale_*_*() function.

Understanding Scales, Guides, and Themes

  • Many scales come with a legend or key to help the reader interpret the graph. These are called guides.
    • We can make adjustments to them with the guides() function.
    • Perhaps the most common use case is to make the legend disappear.
    • Another is to adjust the arrangement of the key in legends and colorbars.
    • guides is also one of the parameters in scales_*_*().

Understanding Scales, Guides, and Themes

  • Graphs have other features not strictly connected to the logical structure of the data being displayed.

    • These include things like their background color, the typeface used for labels, or the placement of the legend on the graph.

    • To adjust these, use the theme() function.

Understanding Scales, Guides, and Themes

p <- ggplot(data = organdata,
            mapping = aes(x = roads, y = donors, color = world))
p + geom_point()
  • scale_*_*() and guides() are closely connected.
    • The guides() provides information about the scale, such as in a legend or colorbar.
    • So, it is possible to make adjustments to guides from inside the various scale_*_*() functions.
  • The x and y scales are both continuous.
  • The color scale is discrete.
    • A color or fill mapping can also be a continuous quantity (colorbar).

Understanding Scales, Guides, and Themes

scale_<MAPPING>_<KIND>

  • Because we have several potential mappings, and each mapping might be to one of several different scales, we end up with a lot of individual scale_*_* functions.
    • Each deals with one combination of mapping and scale. Too many to memorize.
    • They are named according to a consistent logic:
  • https://ggplot2tor.com/apps provides a complete guide to scales and themes, as well as aesthetics.
    • This app makes it easy for you to find the right scales, themes, and arguments for your variable types and aesthetics.

Design with Colorblind in Mind

Visualization Principle

Design with Colorblind in Mind

Types of Colorblindness

  • Roughly 8% of men and half a percent of women are colorblind.

  • There are several techniques to make visualization more colorblind-friendly:

    1. Use color palettes that are colorblind-friendly
    2. Use shape for scatterplots and linetype for line charts
    3. Have some additional visual cue to set the important numbers apart

Colorblind-Friendly Color Palettes

ggthemes package

install.packages("ggthemes")
library(ggthemes)
  • The ggthemes package provides various themes for ggplot2 visualization:
    • Accessible color palettes, including those optimized for colorblind viewers.
      • E.g., scale_color_colorblind(), scale_color_tableau()
    • Unique, predefined themes for specific styles
      • E.g., theme_economist(), theme_wsj()

Colorblind-Friendly Color Palettes

ggthemes::scale_color_colorblind()

ggplot( data = mpg,
        mapping = 
          aes(x = displ,
              y = hwy, 
              color = class) ) + 
  geom_point(size = 3) +
  scale_color_colorblind()

  • When mapping color in aes(), we can use scale_color_*()

Colorblind-Friendly Color Palettes

ggthemes::scale_color_tableau()

ggplot( data = mpg,
        mapping = 
          aes(x = displ,
              y = hwy, 
              color = class) ) + 
  geom_point(size = 3) +
  scale_color_tableau()

  • scale_color_tableau() provides color palettes used in Tableau.

ggplot Themes

ggthemes::theme_economist()

ggplot( data = mpg,
        mapping = 
          aes(x = displ,
              y = hwy, 
              color = class) ) + 
  geom_point(size = 3) +
  scale_color_colorblind() +
  theme_economist()

  • theme_economist() approximates the style of The Economist.

ggplot Themes

ggthemes::theme_wsj()

ggplot( data = mpg,
        mapping = 
          aes(x = displ,
              y = hwy, 
              color = class) ) + 
  geom_point(size = 3) +
  scale_color_colorblind() +
  theme_wsj()

  • theme_wsj() approximates the style of The Wall Street Journal.

Understanding Scales, Guides, and Themes

Add Labels and Make Notes

Add Labels and Make Notes

Plot text directly

p <- ggplot(data = by_country,
            mapping = 
              aes(x = roads_mean, 
                  y = donors_mean))
p + geom_point() + 
  geom_text(mapping = 
              aes(label = country))

  • It can sometimes be useful to plot the labels along with the points in a scatterplot, or just plot informative labels directly using geom_text().

Add Labels and Make Notes

Plot text directly

p + geom_point() + 
  geom_text(mapping = 
              aes(label = country), 
            hjust = 0)

  • By default, the text is plotted right on top of the points.
  • hjust = 0 will left justify the label; hjust = 1 will right justify it.

Add Labels and Make Notes

Add Labels and Make Notes

Plot text directly

install.packages("ggrepel")
library(ggrepel)
  • Instead of wrestling any further with geom_text(), we can use ggrepel::geom_text_repel() instead.

Add Labels and Make Notes

Plot text directly

socviz::elections_historic |> select(2:7) 
  • Let’s use some historical U.S. presidential election data provided in the socviz library.

Add Labels and Make Notes

Plot text directly

p <- ggplot(elections_historic, aes(x = popular_pct, y = ec_pct,
                                    label = winner_label))
  • Step 1. Let’s map aethetics to variables

Add Labels and Make Notes

Plot text directly

p <- p + 
    geom_hline(yintercept = 0.5, 
               linewidth = 1.4, 
               color = "gray80") +
    geom_vline(xintercept = 0.5, 
               linewidth = 1.4, 
               color = "gray80") +
    geom_point() +
    geom_text_repel()
p

  • Step 2. Then, add geometric objects to ggplot.

Add Labels and Make Notes

Plot text directly

p <- p + 
     scale_x_continuous(
       labels = scales::percent) +
     scale_y_continuous(
       labels = scales::percent) 
p

  • Step 3. Let’s set the scales for x and y.

Add Labels and Make Notes

Plot text directly

p_title <- "Presidential Elections: Popular & Electoral College Margins"
p_subtitle <- "1824-2016"
p_caption <- "Data for 2016 are provisional."
x_label <- "Winner's share of Popular Vote"
y_label <- "Winner's share of Electoral College Votes"

p + labs(x = x_label, 
         y = y_label, 
         title = p_title, 
         subtitle = p_subtitle,
         caption = p_caption)

  • Step 4. Let’s set the labels.

Add Labels and Make Notes

Plot text directly

Add Labels and Make Notes

Label Outliers

p <- ggplot(data = by_country,
            mapping = 
              aes(x = gdp_mean, 
                  y = health_mean))

p + geom_point() +
    geom_text_repel(
      data = filter(by_country, 
                    gdp_mean > 25000),
      mapping = 
        aes(label = country))

  • Sometimes we want to pick out some points of interest in the data without labeling every single item.
    • We do this using the filter() function.

Add Labels and Make Notes

Label Outliers

p <- ggplot(data = by_country,
            mapping = 
              aes(x = gdp_mean, 
                  y = health_mean))
p + geom_point() +
    geom_text_repel(
      data = 
        filter(by_country,
               gdp_mean > 25000 | 
                 health_mean < 1500 |
                 country %in% "Belgium"),
      mapping = 
        aes(label = country))

Add Labels and Make Notes

Label Outliers

# creating a dummy variable for labels
organdata <- organdata |> 
  mutate(ind = ccode %in% 
                  c("Ita", "Spa") & 
               year > 1998)  

p <- ggplot(data = organdata,
            mapping = 
              aes(x = roads, 
                  y = donors, 
                  color = ind))
p + 
  geom_point() +
  geom_text_repel(
    data = filter(organdata, ind),
    mapping = aes(label = ccode)) +
  guides(label = "none", 
         color = "none")

  • We can also add a logical variable (either TRUE or FALSE) to label specific points using filter().

Add Labels and Make Notes

Label Outliers

Add Labels and Make Notes

Write and draw in the plot area with annotate(geom = "text")

p <- ggplot(data = organdata, 
            mapping = 
              aes(x = roads, 
                  y = donors))
p + geom_point() + 
  annotate(geom = "text", 
           x = 91, y = 33,
           label = "A surprisingly high \n recovery rate.",
           hjust = 0)

  • We can use annotate() to annotate the figure directly.

Add Labels and Make Notes

Write and draw in the plot area with annotate(geom = "rect")

p <- ggplot(data = organdata,
            mapping = aes(x = roads, y = donors))
p + geom_point() +
    annotate(geom = "rect", 
             xmin = 125, xmax = 155,
             ymin = 30, ymax = 35, 
             fill = "red", 
             alpha = 0.2) + 
    annotate(geom = "text", 
             x = 157, y = 33,
             label = "A surprisingly high \n recovery rate.", 
             hjust = 0)

Add Labels and Make Notes

Write and draw in the plot area with annotate(geom = "point")

p <- ggplot(mpg, aes(displ, hwy)) +
  geom_point(
    data = 
      filter(mpg, 
             manufacturer == "subaru"), 
    color = "orange", 
    size = 3) +
  geom_point() 
p + 
  annotate(geom = "point", 
           x = 5.5, y = 40, 
           colour = "orange", 
           size = 3) + 
  annotate(geom = "point", 
           x = 5.5, y = 40) + 
  annotate(geom = "text", 
           x = 5.6, y = 40, 
           label = "subaru", 
           hjust = "left")

Add Labels and Make Notes

Write and draw in the plot area with annotate(geom = "curve")

p + 
  annotate(
    geom = "curve", 
    x = 4, y = 35, 
    xend = 2.65, yend = 27, 
    curvature = .3, 
    arrow = 
      arrow(length = unit(2, "mm"))
  ) +
  annotate(geom = "text", 
           x = 4.1, y = 35, 
           label = "subaru", 
           hjust = "left")