library(tidyverse)
library(skimr)
library(ggthemes)
library(gapminder)
Homework Assignment 4 - Example Answers
Multiple Choice Questions
Question 1
Which of the following is a core technology behind language models like ChatGPT, as mentioned in the lecture?
Answer: Deep Learning
Explanation: Language models like ChatGPT are based on neural networks and are a core application of deep learning. Decision trees and linear regression are not core technologies for language models.
Question 2
What is the primary purpose of storyboarding in data storytelling?
Answer: To provide a visual outline for content structure
Explanation: Storyboarding is used to visually plan and structure the content of a story, ensuring that the narrative flows logically and effectively.
Question 3
Which of the following statements about data visualizations is true?
Answer: They are useful for showing “what” is happening in the data.
Explanation: Data visualizations are excellent tools for understanding “what” is happening, but they often do not explain “why” trends occur, which requires additional analysis or narratives.
Question 4
In the ggplot2 syntax ggplot(data, aes(x, y)) + geom_point()
, what does aes()
stand for?
Answer: Aesthetic mappings
Explanation: The aes()
function in ggplot2 stands for aesthetic mappings, linking data variables to visual properties like x
, y
, color
, fill
or shape
.
Question 5
The equation \(\Delta\log(x) \approx \Delta x / x_{0}\) demonstrates that a small change in the natural logarithm of \(x\), from an initial value \(\log(x_0)\) to an ending value \(\log(x_{1})\), where the change in \(x\) is given by \(\Delta x = x_{1} - x_{0}\), can be approximated by:
Answer: The percentage change in \(x\)
Explanation: When \(x\) changes by a small amount, the natural logarithm of \(x\) changes approximately by the percentage change in \(x\).
Question 6
Which of the following statements about mapping and setting aesthetics in ggplot2
is FALSE?
Answer: You can set aesthetics manually within aes()
by assigning fixed values.
Explanation: Manually setting fixed aesthetics (e.g., color = "red"
) should be done outside of aes()
. Inside aes()
, values are mapped a variable in a data.frame.
Question 7
Which of the following is NOT a way to explicitly inform ggplot about the grouping structure in a line plot?
Answer: Using the size
aesthetic.
Explanation: The size
aesthetic does not explicitly inform ggplot about grouping in a line plot, whereas group
, color
, and linetype
do.
Question 8
If you have a data frame with a date variable and want to plot a time series for each category in a variable called group_var
, what is the minimal aesthetic mapping required in ggplot2
to correctly plot separate lines for each group?
Answer: Both mapping = aes(x = date_var, y = value_var, color = group_var)
and mapping = aes(x = date_var, y = value_var, group = group_var)
Explanation: Both options correctly separate lines for each group. group
ensures proper grouping, while color
differentiates groups visually.
Question 9
When creating a vertical boxplot in ggplot2
, which of the following mappings is correct?
Answer: Map a categorical variable to x
and a numeric variable to y
.
Explanation: In a vertical boxplot, the x-axis typically represents categories, while the y-axis represents the numeric variable whose distribution is being summarized.
Question 10
Which of the following functions is used to apply a color-blind friendly palette to the fill
aesthetic in ggplot2
?
Answer: scale_fill_tableau()
or scale_fill_manual()
Explanation: - The scale_fill_tableau()
function, part of the ggthemes
package, extends ggplot2 with Tableau-inspired themes and scales, including color-blind friendly palettes.
- The
scale_fill_manual()
function can used to apply a custom color palette, including color-blind friendly palettes, to the fill aesthetic in ggplot2.
Data Visualization with ggplot
The followings are the R packages for this homework assignment:
Questions 11-17
Consider the following titanic
data.frame for Questions 11-17:
<- read_csv("https://bcdanl.github.io/data/titanic_cleaned.csv") titanic
Question 11
How would you create the following data.frame, titanic_class_survival
?
- The
titanic_class_survival
data.frame counts the number of passengers who survived and those who did not survive within eachclass
in thetitanic
data.frame.
Complete the code by filling in the blanks.
<- titanic |>
__BLANK 1__ count(__BLANK 2__)
Answer:
<- titanic |>
titanic_class_survival count(class, survived)
Question 12
How would you describe the variation in the distribution of age
across classes
and genders
?
Complete the code by filling in the blanks.
ggplot(data = __BLANK 1__,
mapping = aes(x = gender,
__ = age,
__BLANK 2__ = gender)) +
__BLANK 3__(show.legend = F) +
__BLANK 4__(~class) +
__BLANK 5scale_fill_tableau()
Answer:
ggplot(data = titanic,
mapping = aes(x = gender,
y = age,
fill = gender)) +
geom_boxplot(show.legend = F) +
facet_wrap(~class) +
scale_fill_tableau()
Question 13
Provide a comment on the variation in the distribution of age
across classes
and genders
.
Answer:
- For both female and male groups, the
ages
of the first class passengers in the Titanic ranges wider than the second class and the third class. - For both female and male groups,the median of the first class passengers’s
ages
is higher than that of the second class and the third class. - The first quartile of female’s
age
is always lower than that of male’s across all classes. Particularly, the such gap is wider for the first class.
Question 14
How would you describe the variation in the distribution of survived
across classes
and genders
?
Complete the code by filling in the blanks.
ggplot(data = __BLANK 1__,
mapping = aes(__BLANK 2__ = class,
__ = survived)) +
__BLANK 3__() +
__BLANK 4__(~gender) +
__BLANK 5labs(x = "Proportion") +
scale_fill_tableau()
Answer:
ggplot(data = titanic,
mapping = aes(y = class,
fill = survived)) +
geom_bar() +
facet_wrap(~gender) +
labs(x = "Proportion") +
scale_fill_tableau()
Question 15
How would you describe the variation in the distribution of survived
across classes
and genders
?
Complete the code by filling in the blanks.
ggplot(data = __BLANK 1__,
mapping = aes(__BLANK 2__ = class,
__ = survived)) +
__BLANK 3__(position = __BLANK 5__) +
__BLANK 4__(~gender) +
__BLANK 6labs(x = "Proportion") +
scale_fill_tableau()
Answer:
ggplot(data = titanic,
mapping = aes(y = class,
fill = survived)) +
geom_bar(position = "fill") +
facet_wrap(~gender) +
labs(x = "Proportion") +
scale_fill_tableau()
Question 16
How would you describe the variation in the distribution of survived
across classes
and genders
?
Complete the code by filling in the blanks.
ggplot(data = __BLANK 1__,
mapping = aes(__BLANK 2__ = class,
__ = survived)) +
__BLANK 3__(position = __BLANK 5__) +
__BLANK 4__(~gender) +
__BLANK 6scale_fill_tableau()
Answer:
ggplot(data = titanic,
mapping = aes(y = class,
fill = survived)) +
geom_bar(position = "dodge") +
facet_wrap(~gender) +
scale_fill_tableau()
Question 17
Provide a comment on the variation in the distribution of survived
across classes
and genders
.
Answer:
- First-class passengers had the highest survival rate.
- Survival rates decline progressively in second and third classes.
- Female passengers generally had a much higher survival rate compared to male passengers.
- Female first-class passengers had the highest survival rates, followed by female second-class and third-class passengers.
- Male survival rates were considerably lower in all classes, with third-class males experiencing the lowest survival likelihood.
- These patterns may be attributed to the influence of both socioeconomic status and gender norms prevalent in the early 1900s.
Questions 18-20
Consider the following nyc_dogs
data.frame for Questions 18-20:
<- read_csv("https://bcdanl.github.io/data/nyc_dogs_cleaned.csv") nyc_dogs
- The
nyc_dogs
data.frame contains data on licensed dogs in New York city.
Question 18
How would you create the following data.frame, nyc_dogs_breeds
?
- The
nyc_dogs_breeds
data.frame counts the number of occurrences for each value in thebreed
variable in thenyc_dogs
data.frame.- The
nyc_dogs_breeds
data.frame keeps observations if- The number of occurrences (
n
) is greater than or equal to 2000; - The value of
breed
is not missing.
- The number of occurrences (
- The observations in the
nyc_dogs_breeds
data.frame is arranged byn
in descending order.
- The
Complete the code by filling in the blanks.
<- nyc_dogs |>
__BLANK 1__ |>
__BLANK 2__ filter(__BLANK 3__(breed)) |>
filter(__BLANK 4__) |>
arrange(__BLANK 5__)
Answer:
<- nyc_dogs |>
nyc_dogs_breeds count(breed) |>
filter(!is.na(breed)) |>
filter(n >= 2000) |>
arrange(-n) # or arrange(desc(n))
Question 19
How would you describe the distribution of breed
using the nyc_dogs_breeds
data.frame?
Complete the code by filling in the blanks.
ggplot(data = __BLANK 1__,
mapping = aes(x = __BLANK 1__,
+
__BLANK 3__)) __() __BLANK 4
Answer:
ggplot(data = nyc_dogs_breeds,
mapping = aes(x = n,
y = breed)) +
geom_col()
Question 20
How would you describe the distribution of breed
using the nyc_dogs_breeds
data.frame?
Complete the code by filling in the blanks.
ggplot(data = __BLANK 1__,
mapping = aes(x = __BLANK 1__,
+
__BLANK 3__)) __() +
__BLANK 4labs(y = "Breed")
Answer:
ggplot(data = nyc_dogs_breeds,
mapping = aes(x = n,
y = fct_reorder(breed, n))) +
geom_col() +
labs(y = "Breed")