Homework Assignment 4 - Example Answers

Author

Byeong-Hak Choe

Published

November 17, 2024

Modified

November 17, 2024

Multiple Choice Questions

Question 1

Which of the following is a core technology behind language models like ChatGPT, as mentioned in the lecture?


Answer: Deep Learning

Explanation: Language models like ChatGPT are based on neural networks and are a core application of deep learning. Decision trees and linear regression are not core technologies for language models.


Question 2

What is the primary purpose of storyboarding in data storytelling?


Answer: To provide a visual outline for content structure

Explanation: Storyboarding is used to visually plan and structure the content of a story, ensuring that the narrative flows logically and effectively.


Question 3

Which of the following statements about data visualizations is true?


Answer: They are useful for showing “what” is happening in the data.

Explanation: Data visualizations are excellent tools for understanding “what” is happening, but they often do not explain “why” trends occur, which requires additional analysis or narratives.

Question 4

In the ggplot2 syntax ggplot(data, aes(x, y)) + geom_point(), what does aes() stand for?


Answer: Aesthetic mappings

Explanation: The aes() function in ggplot2 stands for aesthetic mappings, linking data variables to visual properties like x, y, color, fill or shape.


Question 5

The equation \(\Delta\log(x) \approx \Delta x / x_{0}\) demonstrates that a small change in the natural logarithm of \(x\), from an initial value \(\log(x_0)\) to an ending value \(\log(x_{1})\), where the change in \(x\) is given by \(\Delta x = x_{1} - x_{0}\), can be approximated by:


Answer: The percentage change in \(x\)

Explanation: When \(x\) changes by a small amount, the natural logarithm of \(x\) changes approximately by the percentage change in \(x\).


Question 6

Which of the following statements about mapping and setting aesthetics in ggplot2 is FALSE?


Answer: You can set aesthetics manually within aes() by assigning fixed values.

Explanation: Manually setting fixed aesthetics (e.g., color = "red") should be done outside of aes(). Inside aes(), values are mapped a variable in a data.frame.


Question 7

Which of the following is NOT a way to explicitly inform ggplot about the grouping structure in a line plot?


Answer: Using the size aesthetic.

Explanation: The size aesthetic does not explicitly inform ggplot about grouping in a line plot, whereas group, color, and linetype do.


Question 8

If you have a data frame with a date variable and want to plot a time series for each category in a variable called group_var, what is the minimal aesthetic mapping required in ggplot2 to correctly plot separate lines for each group?


Answer: Both mapping = aes(x = date_var, y = value_var, color = group_var) and mapping = aes(x = date_var, y = value_var, group = group_var)

Explanation: Both options correctly separate lines for each group. group ensures proper grouping, while color differentiates groups visually.


Question 9

When creating a vertical boxplot in ggplot2, which of the following mappings is correct?


Answer: Map a categorical variable to x and a numeric variable to y.

Explanation: In a vertical boxplot, the x-axis typically represents categories, while the y-axis represents the numeric variable whose distribution is being summarized.


Question 10

Which of the following functions is used to apply a color-blind friendly palette to the fill aesthetic in ggplot2?


Answer: scale_fill_tableau() or scale_fill_manual()

Explanation: - The scale_fill_tableau() function, part of the ggthemes package, extends ggplot2 with Tableau-inspired themes and scales, including color-blind friendly palettes.

  • The scale_fill_manual() function can used to apply a custom color palette, including color-blind friendly palettes, to the fill aesthetic in ggplot2.



Data Visualization with ggplot

The followings are the R packages for this homework assignment:

library(tidyverse)
library(skimr)
library(ggthemes)
library(gapminder)

Questions 11-17

Consider the following titanic data.frame for Questions 11-17:

titanic <- read_csv("https://bcdanl.github.io/data/titanic_cleaned.csv")


Question 11

How would you create the following data.frame, titanic_class_survival?

  • The titanic_class_survival data.frame counts the number of passengers who survived and those who did not survive within each class in the titanic data.frame.

Complete the code by filling in the blanks.

__BLANK 1__ <- titanic |> 
  count(__BLANK 2__)


Answer:

titanic_class_survival <- titanic |> 
  count(class, survived)


Question 12

How would you describe the variation in the distribution of age across classes and genders?

Complete the code by filling in the blanks.

ggplot(data = __BLANK 1__,
       mapping = aes(x = gender,
                     __BLANK 2__ = age,
                     __BLANK 3__ = gender)) +
  __BLANK 4__(show.legend = F) +
  __BLANK 5__(~class) +
  scale_fill_tableau()


Answer:

ggplot(data = titanic,
       mapping = aes(x = gender,
                     y = age,
                     fill = gender)) +
  geom_boxplot(show.legend = F) +
  facet_wrap(~class) +
  scale_fill_tableau()


Question 13

Provide a comment on the variation in the distribution of age across classes and genders.


Answer:

  • For both female and male groups, the ages of the first class passengers in the Titanic ranges wider than the second class and the third class.
  • For both female and male groups,the median of the first class passengers’s ages is higher than that of the second class and the third class.
  • The first quartile of female’s age is always lower than that of male’s across all classes. Particularly, the such gap is wider for the first class.


Question 14

How would you describe the variation in the distribution of survived across classes and genders?

Complete the code by filling in the blanks.

ggplot(data = __BLANK 1__,
       mapping = aes(__BLANK 2__ = class,
                     __BLANK 3__ = survived)) +
  __BLANK 4__() +
  __BLANK 5__(~gender) +
  labs(x = "Proportion") +
  scale_fill_tableau()


Answer:

ggplot(data = titanic,
       mapping = aes(y = class,
                     fill = survived)) +
  geom_bar() +
  facet_wrap(~gender) +
  labs(x = "Proportion") +
  scale_fill_tableau()


Question 15

How would you describe the variation in the distribution of survived across classes and genders?

Complete the code by filling in the blanks.

ggplot(data = __BLANK 1__,
       mapping = aes(__BLANK 2__ = class,
                     __BLANK 3__ = survived)) +
  __BLANK 4__(position = __BLANK 5__) +
  __BLANK 6__(~gender) +
  labs(x = "Proportion") +
  scale_fill_tableau()


Answer:

ggplot(data = titanic,
       mapping = aes(y = class,
                     fill = survived)) +
  geom_bar(position = "fill") +
  facet_wrap(~gender) +
  labs(x = "Proportion") +
  scale_fill_tableau()


Question 16

How would you describe the variation in the distribution of survived across classes and genders?

Complete the code by filling in the blanks.

ggplot(data = __BLANK 1__,
       mapping = aes(__BLANK 2__ = class,
                     __BLANK 3__ = survived)) +
  __BLANK 4__(position = __BLANK 5__) +
  __BLANK 6__(~gender) +
  scale_fill_tableau()


Answer:

ggplot(data = titanic,
       mapping = aes(y = class,
                     fill = survived)) +
  geom_bar(position = "dodge") +
  facet_wrap(~gender) +
  scale_fill_tableau()


Question 17

Provide a comment on the variation in the distribution of survived across classes and genders.


Answer:

  • First-class passengers had the highest survival rate.
    • Survival rates decline progressively in second and third classes.
  • Female passengers generally had a much higher survival rate compared to male passengers.
    • Female first-class passengers had the highest survival rates, followed by female second-class and third-class passengers.
    • Male survival rates were considerably lower in all classes, with third-class males experiencing the lowest survival likelihood.
  • These patterns may be attributed to the influence of both socioeconomic status and gender norms prevalent in the early 1900s.


Questions 18-20

Consider the following nyc_dogs data.frame for Questions 18-20:

nyc_dogs <- read_csv("https://bcdanl.github.io/data/nyc_dogs_cleaned.csv")
  • The nyc_dogs data.frame contains data on licensed dogs in New York city.


Question 18

How would you create the following data.frame, nyc_dogs_breeds?

  • The nyc_dogs_breeds data.frame counts the number of occurrences for each value in the breed variable in the nyc_dogs data.frame.
    • The nyc_dogs_breeds data.frame keeps observations if
      1. The number of occurrences (n) is greater than or equal to 2000;
      2. The value of breed is not missing.
    • The observations in the nyc_dogs_breeds data.frame is arranged by n in descending order.

Complete the code by filling in the blanks.

__BLANK 1__ <- nyc_dogs |> 
  __BLANK 2__ |> 
  filter(__BLANK 3__(breed)) |> 
  filter(__BLANK 4__) |> 
  arrange(__BLANK 5__)


Answer:

nyc_dogs_breeds <- nyc_dogs |> 
  count(breed) |> 
  filter(!is.na(breed)) |> 
  filter(n >= 2000) |> 
  arrange(-n)  # or arrange(desc(n))


Question 19

How would you describe the distribution of breed using the nyc_dogs_breeds data.frame?

Complete the code by filling in the blanks.

ggplot(data = __BLANK 1__,
       mapping = aes(x = __BLANK 1__,
                     __BLANK 3__)) +
  __BLANK 4__()


Answer:

ggplot(data = nyc_dogs_breeds,
       mapping = aes(x = n,
                     y = breed)) +
  geom_col()


Question 20

How would you describe the distribution of breed using the nyc_dogs_breeds data.frame?

Complete the code by filling in the blanks.

ggplot(data = __BLANK 1__,
       mapping = aes(x = __BLANK 1__,
                     __BLANK 3__)) +
  __BLANK 4__() +
  labs(y = "Breed")


Answer:

ggplot(data = nyc_dogs_breeds,
       mapping = aes(x = n,
                     y = fct_reorder(breed, n))) +
  geom_col() +
  labs(y = "Breed")


Back to top