Homework Assignment 4 - Example Answers

Author

Byeong-Hak Choe

Published

November 17, 2024

Modified

November 17, 2024

Multiple Choice Questions

Question 1

Which of the following is a core technology behind language models like ChatGPT, as mentioned in the lecture?

Decision Trees Deep Learning Linear Regression All of the Above

Answer: Deep Learning

Explanation: Language models like ChatGPT are based on neural networks and are a core application of deep learning. Decision trees and linear regression are not core technologies for language models.

Question 2

What is the primary purpose of storyboarding in data storytelling?

To collect data from stakeholders To provide a visual outline for content structure To design complex data visualizations To perform statistical analysis

Answer: To provide a visual outline for content structure

Explanation: Storyboarding is used to visually plan and structure the content of a story, ensuring that the narrative flows logically and effectively.

Question 3

Which of the following statements about data visualizations is true?

They always explain the underlying reasons behind data trends. They are sufficient on their own to drive data-informed decisions. They are useful for showing “what” is happening in the data. They eliminate the need for narratives in data storytelling.

Answer: They are useful for showing “what” is happening in the data.

Explanation: Data visualizations are excellent tools for understanding “what” is happening, but they often do not explain “why” trends occur, which requires additional analysis or narratives.

Question 4

In the ggplot2 syntax ggplot(data, aes(x, y)) + geom_point(), what does aes() stand for?

Aesthetic mappings Arithmetic expressions Axis equal scaling Average error squares

Answer: Aesthetic mappings

Explanation: The aes() function in ggplot2 stands for aesthetic mappings, linking data variables to visual properties like x, y, color, fill or shape.

Question 5

The equation \(\Delta\log(x) \approx \Delta x / x_{0}\) demonstrates that a small change in the natural logarithm of \(x\), from an initial value \(\log(x_0)\) to an ending value \(\log(x_{1})\), where the change in \(x\) is given by \(\Delta x = x_{1} - x_{0}\), can be approximated by:

The absolute change in \(x\) The percentage change in \(x\) The square of \(x\) The reciprocal of \(x\)

Answer: The percentage change in \(x\)

Explanation: When \(x\) changes by a small amount, the natural logarithm of \(x\) changes approximately by the percentage change in \(x\).

Question 6

Which of the following statements about mapping and setting aesthetics in ggplot2 is FALSE?

Mapping aesthetics involves linking data variables to visual properties within aes() Setting aesthetics manually is done outside of aes() to set fixed visual properties. You can set aesthetics manually within aes() by assigning fixed values. Mapping aesthetics allows different categories to be represented by different colors or shapes.

Answer: You can set aesthetics manually within aes() by assigning fixed values.

Explanation: Manually setting fixed aesthetics (e.g., color = "red") should be done outside of aes(). Inside aes(), values are mapped a variable in a data.frame.

Question 7

Which of the following is NOT a way to explicitly inform ggplot about the grouping structure in a line plot?

Using the group aesthetic. Using the color aesthetic. Using the linetype aesthetic. Using the size aesthetic.

Answer: Using the size aesthetic.

Explanation: The size aesthetic does not explicitly inform ggplot about grouping in a line plot, whereas group, color, and linetype do.

Question 8

If you have a data frame with a date variable and want to plot a time series for each category in a variable called group_var, what is the minimal aesthetic mapping required in ggplot2 to correctly plot separate lines for each group?

mapping = aes(x = date_var, y = value_var) mapping = aes(x = date_var, y = value_var, color = group_var) mapping = aes(x = date_var, y = value_var, group = group_var) Both mapping = aes(x = date_var, y = value_var, color = group_var) and mapping = aes(x = date_var, y = value_var, group = group_var)

Answer: Both mapping = aes(x = date_var, y = value_var, color = group_var) and mapping = aes(x = date_var, y = value_var, group = group_var)

Explanation: Both options correctly separate lines for each group. group ensures proper grouping, while color differentiates groups visually.

Question 9

When creating a vertical boxplot in ggplot2, which of the following mappings is correct?

Map both the x and y aesthetics to categorical variables. Map a numeric variable to x and a categorical variable to y. Map a categorical variable to x and a numeric variable to y. Map both the x and y aesthetics to numeric variables.

Answer: Map a categorical variable to x and a numeric variable to y.

Explanation: In a vertical boxplot, the x-axis typically represents categories, while the y-axis represents the numeric variable whose distribution is being summarized.

Question 10

Which of the following functions is used to apply a color-blind friendly palette to the fill aesthetic in ggplot2?

scale_fill_manual() scale_fill_tableau() scale_color_gradient() scale_x_continuous()

Answer: scale_fill_tableau() or scale_fill_manual()

Explanation: - The scale_fill_tableau() function, part of the ggthemes package, extends ggplot2 with Tableau-inspired themes and scales, including color-blind friendly palettes.

The scale_fill_manual() function can used to apply a custom color palette, including color-blind friendly palettes, to the fill aesthetic in ggplot2.

Data Visualization with `ggplot`

The followings are the R packages for this homework assignment:

library(tidyverse)
library(skimr)
library(ggthemes)
library(gapminder)

Questions 11-17

Consider the following titanic data.frame for Questions 11-17:

titanic <- read_csv("https://bcdanl.github.io/data/titanic_cleaned.csv")

Question 11

How would you create the following data.frame, titanic_class_survival?

The titanic_class_survival data.frame counts the number of passengers who survived and those who did not survive within each class in the titanic data.frame.

Complete the code by filling in the blanks.

__BLANK 1__ <- titanic |> 
  count(__BLANK 2__)

Answer:

titanic_class_survival <- titanic |> 
  count(class, survived)

Question 12

How would you describe the variation in the distribution of age across classes and genders?

Complete the code by filling in the blanks.

ggplot(data = __BLANK 1__,
       mapping = aes(x = gender,
                     __BLANK 2__ = age,
                     __BLANK 3__ = gender)) +
  __BLANK 4__(show.legend = F) +
  __BLANK 5__(~class) +
  scale_fill_tableau()

Answer:

ggplot(data = titanic,
       mapping = aes(x = gender,
                     y = age,
                     fill = gender)) +
  geom_boxplot(show.legend = F) +
  facet_wrap(~class) +
  scale_fill_tableau()

Question 13

Provide a comment on the variation in the distribution of age across classes and genders.

Answer:

For both female and male groups, the ages of the first class passengers in the Titanic ranges wider than the second class and the third class.
For both female and male groups,the median of the first class passengers’s ages is higher than that of the second class and the third class.
The first quartile of female’s age is always lower than that of male’s across all classes. Particularly, the such gap is wider for the first class.

Question 14

How would you describe the variation in the distribution of survived across classes and genders?

Complete the code by filling in the blanks.

ggplot(data = __BLANK 1__,
       mapping = aes(__BLANK 2__ = class,
                     __BLANK 3__ = survived)) +
  __BLANK 4__() +
  __BLANK 5__(~gender) +
  labs(x = "Proportion") +
  scale_fill_tableau()

Answer:

ggplot(data = titanic,
       mapping = aes(y = class,
                     fill = survived)) +
  geom_bar() +
  facet_wrap(~gender) +
  labs(x = "Proportion") +
  scale_fill_tableau()

Question 15

How would you describe the variation in the distribution of survived across classes and genders?

Complete the code by filling in the blanks.

ggplot(data = __BLANK 1__,
       mapping = aes(__BLANK 2__ = class,
                     __BLANK 3__ = survived)) +
  __BLANK 4__(position = __BLANK 5__) +
  __BLANK 6__(~gender) +
  labs(x = "Proportion") +
  scale_fill_tableau()

Answer:

ggplot(data = titanic,
       mapping = aes(y = class,
                     fill = survived)) +
  geom_bar(position = "fill") +
  facet_wrap(~gender) +
  labs(x = "Proportion") +
  scale_fill_tableau()

Question 16

How would you describe the variation in the distribution of survived across classes and genders?

Complete the code by filling in the blanks.

ggplot(data = __BLANK 1__,
       mapping = aes(__BLANK 2__ = class,
                     __BLANK 3__ = survived)) +
  __BLANK 4__(position = __BLANK 5__) +
  __BLANK 6__(~gender) +
  scale_fill_tableau()

Answer:

ggplot(data = titanic,
       mapping = aes(y = class,
                     fill = survived)) +
  geom_bar(position = "dodge") +
  facet_wrap(~gender) +
  scale_fill_tableau()

Question 17

Provide a comment on the variation in the distribution of survived across classes and genders.

Answer:

First-class passengers had the highest survival rate.
- Survival rates decline progressively in second and third classes.
Female passengers generally had a much higher survival rate compared to male passengers.
- Female first-class passengers had the highest survival rates, followed by female second-class and third-class passengers.
- Male survival rates were considerably lower in all classes, with third-class males experiencing the lowest survival likelihood.
These patterns may be attributed to the influence of both socioeconomic status and gender norms prevalent in the early 1900s.

Questions 18-20

Consider the following nyc_dogs data.frame for Questions 18-20:

nyc_dogs <- read_csv("https://bcdanl.github.io/data/nyc_dogs_cleaned.csv")

The nyc_dogs data.frame contains data on licensed dogs in New York city.

Question 18

How would you create the following data.frame, nyc_dogs_breeds?

The nyc_dogs_breeds data.frame counts the number of occurrences for each value in the breed variable in the nyc_dogs data.frame.
- The nyc_dogs_breeds data.frame keeps observations if
  1. The number of occurrences (n) is greater than or equal to 2000;
  2. The value of breed is not missing.
- The observations in the nyc_dogs_breeds data.frame is arranged by n in descending order.

Complete the code by filling in the blanks.

__BLANK 1__ <- nyc_dogs |> 
  __BLANK 2__ |> 
  filter(__BLANK 3__(breed)) |> 
  filter(__BLANK 4__) |> 
  arrange(__BLANK 5__)

Answer:

nyc_dogs_breeds <- nyc_dogs |> 
  count(breed) |> 
  filter(!is.na(breed)) |> 
  filter(n >= 2000) |> 
  arrange(-n)  # or arrange(desc(n))

Question 19

How would you describe the distribution of breed using the nyc_dogs_breeds data.frame?

Complete the code by filling in the blanks.

ggplot(data = __BLANK 1__,
       mapping = aes(x = __BLANK 1__,
                     __BLANK 3__)) +
  __BLANK 4__()

Answer:

ggplot(data = nyc_dogs_breeds,
       mapping = aes(x = n,
                     y = breed)) +
  geom_col()

Question 20

How would you describe the distribution of breed using the nyc_dogs_breeds data.frame?

Complete the code by filling in the blanks.

ggplot(data = __BLANK 1__,
       mapping = aes(x = __BLANK 1__,
                     __BLANK 3__)) +
  __BLANK 4__() +
  labs(y = "Breed")

Answer:

ggplot(data = nyc_dogs_breeds,
       mapping = aes(x = n,
                     y = fct_reorder(breed, n))) +
  geom_col() +
  labs(y = "Breed")

Multiple Choice Questions

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Question 7

Question 8

Question 9

Question 10

Data Visualization with ggplot

Questions 11-17

Question 11

Question 12

Question 13

Question 14

Question 15

Question 16

Question 17

Questions 18-20

Question 18

Question 19

Question 20

Data Visualization with `ggplot`