Graph tables, add labels, make notes
February 12, 2025
ggplot()
ggplot()
Step 1. Figure out whether variables of interests are categorical or continuous.
Step 2. Think which geometric objects, aesthetic mappings, and faceting are appropriate to visualize distributions and relationships.
Step 3. If needed, transform a given data.frame
(e.g., filtered observations, new variables, summarized data) and try new visualizations.
ggplot()
geom_bar()
and more)geom_histogram()
and more)geom_bar()
and more)geom_point()
with geom_smooth()
and more)geom_boxplot()
and more)geom_bar()
and more)geom_line()
and more)data.frame
before we send it to ggplot
to be turned into a figure.
dplyr
’s “action verbs” to filter
, select
, group
, mutate
, summarize
and transform our data.dplyr
functions to solve various data manipulation challengesdplyr
basicsfilter()
).arrange()
).select()
).rename()
).mutate()
).relocate()
).summarize()
).group_by()
).dplyr
basicsTools -> Global Options -> Code -> Check “User native pipe operator”
data.frame
and the output is a data.frame
, dplyr
verbs work well with the pipe, |>
|>
) takes the thing on its left and passes it along to the function on its right so that
f(x, y)
is equivalent to x |> f(y)
.filter(DATA_FRAME, LOGICAL_STATEMENT)
is equivalent to DATA_FRAME |> filter(LOGICAL_STATEMENT)
.|>
) is “then”.
|>
) is super useful when we have a chain of data transforming operations to do.dplyr
basicsDATA_FRAME |> filter(LOGICAL_CONDITIONS)
DATA_FRAME |> arrange(VARIABLES)
DATA_FRAME |> select(VARIABLES)
DATA_FRAME |> rename(NEW_VAR = EXISTING_VAR)
DATA_FRAME |> mutate(NEW_VARIABLE = ... )
DATA_FRAME |> relocate(VARIABLES)
DATA_FRAME |> group_by(VARIABLES)
DATA_FRAME |> summarize(NEW_VARIABLE = ...)
The subsequent arguments describe what to do with the data.frame, mostly using variable names.
socviz::gss_sm
data.frame.Group the data into the nested structure we want for our summary, such as “Religion by Region” or “Authors by Publications by Year”.
Filter or select pieces of the data by row, column, or both.
Mutate the data by creating new variables at the current level of grouping.
Summarize or aggregate the grouped data.
mean()
, sums with sum()
, and counts with n()
) at a higher level of grouping.We use the dplyr functions, group_by()
, filter()
, select()
, mutate()
, and summarize()
, to carry out these data transformation tasks within our pipeline (|>
, Ctrl/Cmd + Shift + M).
Here we create a new data.frame called rel_by_region
.
geom_col()
instead of geom_bar()
.socviz::organdata
data.frame.donors
for each country.donors
using geom_boxplot()
, but without paying attention to the time trend.fct_reorder(f, x, fun)
, which can take three arguments.
f
: the factor whose levels we want to modify.x
: a numeric vector that we want to use to reorder the levels.fun
: a function that’s used if there are multiple values of x
for each value of f
. The default value is median.by_country <- organdata |>
group_by(consent_law, country) |>
summarize(donors_mean= mean(donors, na.rm = TRUE),
donors_sd = sd(donors, na.rm = TRUE),
gdp_mean = mean(gdp, na.rm = TRUE),
health_mean = mean(health, na.rm = TRUE),
roads_mean = mean(roads, na.rm = TRUE),
cerebvas_mean = mean(cerebvas, na.rm = TRUE))
Summarize the data.frame organdata
to calculate the mean and the standard deviation of each numeric variable for each consent_law
-country
pair.
Would there be a simpler way to do the task above?
What we would like to do is apply the mean()
and sd()
functions to every numerical variable in organdata
, but only the numerical ones.
summarize_if( is.numeric, lst(mean, sd), na.rm = T)
works really well.geom_pointrange()
, we can tell ggplot to show us a point estimate and a range around it.
geom_pointrange()
, we map our x
and y
variables as usual, but the function needs a little more information than geom_point()
, for example (ymin
, ymax
) or (xmin
, xmax
).We have used scale_x_log10()
, scale_x_continuous()
and otherscale_*_*()
functions to adjust axis labels.
We used the guides()
function to remove the legends for a color mapping and a label
mapping.
We also used the theme()
function to move the position of a legend from the side to the top of a figure.
What are the differences between the scale_*_*()
functions, the guides()
function, and the theme()
function?
When do we know to use one rather than the other? Why are there so many scale_*_*()
functions? How can we tell which one we need?
Here is a rough and ready starting point:
Every aesthetic mapping has a scale.
scale_*_*()
function.guides
.
guides()
function.guides
is also one of the parameters in scales_*_*()
.Graphs have other features not strictly connected to the logical structure of the data being displayed.
These include things like their background color, the typeface used for labels, or the placement of the legend on the graph.
To adjust these, use the theme()
function.
scale_*_*()
and guides()
are closely connected.
guides()
provides information about the scale
, such as in a legend or colorbar.scale_*_*()
functions.x
and y
scales are both continuous.color
scale is discrete.
color
or fill
mapping can also be a continuous quantity (colorbar).scale_<MAPPING>_<KIND>
scale_*_*
functions.
Types of Colorblindness
Roughly 8% of men and half a percent of women are colorblind.
There are several techniques to make visualization more colorblind-friendly:
shape
for scatterplots and linetype
for line chartsggthemes
packageggthemes
package provides various themes for ggplot2
visualization:
scale_color_colorblind()
, scale_color_tableau()
theme_economist()
, theme_wsj()
ggthemes::scale_color_colorblind()
color
in aes()
, we can use scale_color_*()
ggthemes::scale_color_tableau()
scale_color_tableau()
provides color palettes used in Tableau.ggplot
Themesggthemes::theme_economist()
theme_economist()
approximates the style of The Economist.ggplot
Themesggthemes::theme_wsj()
theme_wsj()
approximates the style of The Wall Street Journal.geom_text()
.hjust = 0
will left justify the label; hjust = 1
will right justify it.geom_text()
, we can use ggrepel::geom_text_repel()
instead.socviz
library.x
and y
.p_title <- "Presidential Elections: Popular & Electoral College Margins"
p_subtitle <- "1824-2016"
p_caption <- "Data for 2016 are provisional."
x_label <- "Winner's share of Popular Vote"
y_label <- "Winner's share of Electoral College Votes"
p + labs(x = x_label,
y = y_label,
title = p_title,
subtitle = p_subtitle,
caption = p_caption)
filter()
function.# creating a dummy variable for labels
organdata <- organdata |>
mutate(ind = ccode %in%
c("Ita", "Spa") &
year > 1998)
p <- ggplot(data = organdata,
mapping =
aes(x = roads,
y = donors,
color = ind))
p +
geom_point() +
geom_text_repel(
data = filter(organdata, ind),
mapping = aes(label = ccode)) +
guides(label = "none",
color = "none")
TRUE
or FALSE
) to label specific points using filter()
.annotate(geom = "text")
annotate()
to annotate the figure directly.annotate(geom = "rect")
annotate(geom = "point")
p <- ggplot(mpg, aes(displ, hwy)) +
geom_point(
data =
filter(mpg,
manufacturer == "subaru"),
color = "orange",
size = 3) +
geom_point()
p +
annotate(geom = "point",
x = 5.5, y = 40,
colour = "orange",
size = 3) +
annotate(geom = "point",
x = 5.5, y = 40) +
annotate(geom = "text",
x = 5.6, y = 40,
label = "subaru",
hjust = "left")
annotate(geom = "curve")