Homework 1

ggplot2 · dplyr Fundamentals

Author

Byeong-Hak Choe

Published

February 11, 2026

Modified

February 11, 2026

📌 Directions

  • Submit one Quarto document (.qmd) to Brightspace:

    • danl-310-hw1-LASTNAME-FIRSTNAME.qmd
      (e.g., danl-310-hw1-choe-byeonghak.qmd)
  • Due: February 18, 2026, 11:59 P.M. (ET)

  • For visualization questions, you must provide:

    1. the ggplot2 code, and
    2. a written comment (2–4 sentences) interpreting the corresponding figure.
  • Unless a question says otherwise, use dplyr verbs (filter(), distinct(), select(), mutate(), group_by(), summarise(), arrange(), count(), etc.) and ggplot2.


✅ Setup

library(tidyverse)
library(skimr)



Part 1. Data Visualization & Summaries

A. Orange Juice Promotions (oj)

Use the following dataset for Questions 1–6.

oj <- read_csv("https://bcdanl.github.io/data/dominick_oj_na.csv")

Question 1. Quick inspection

  1. Print the first 10 rows of oj.
  2. Use skimr::skim() (or another method) to summarize the variables.
# YOUR CODE HERE


Question 2. Brand-level descriptive statistics (filter + group summary)

Filter to the three brands: tropicana, minute.maid, and dominicks. Then compute mean and standard deviation of:

  • sales
  • price

for each brand.

# YOUR CODE HERE

Write 2–3 sentences comparing the three brands based on your results.


Question 3. Remove missing values

Create a new data frame oj_no_na that removes observations with missing values in either price or sales.

  1. Show the number of rows in oj and oj_no_na.
  2. Report how many rows were removed.
# YOUR CODE HERE


Question 4. Price distribution by brand (ggplot + comment)

Using oj_no_na, make a figure that compares the distribution of price across brand.

  • Choose ONE main approach (e.g., boxplot, violin plot, density plot, histogram with facets).
  • Include clear axis labels and a readable title.

(a) Provide your ggplot2 code:

# YOUR CODE HERE

(b) Comment (2–3 sentences): What differences do you see across brands (center, spread, skewness, outliers, etc.)?


Question 5. Log–log price–sales relationship by brand (ggplot + comment)

Using oj_no_na, visualize how the relationship between:

  • log10(sales) and log10(price)

varies by brand.

  • Use a scatter plot with transparency (e.g., alpha = 0.3).
  • Add a fitted line (e.g., geom_smooth(method = "lm", se = FALSE)), and
  • Use faceting OR color to distinguish brands.

(a) Provide your ggplot2 code:

# YOUR CODE HERE

(b) Comment (2–3 sentences): Do you see evidence that higher prices are associated with lower sales? Does the pattern differ by brand?


Question 6. Log–log relationship by brand and ad status (ggplot + comment)

Now extend Question 5 by incorporating ad_status. Visualize how the relationship between:

  • log10(sales) and log10(price)

varies by brand and ad_status.

  • Use faceting (facet_grid() or facet_wrap()), and/or
  • Use color/shape for ad_status.

(a) Provide your ggplot2 code:

# YOUR CODE HERE

(b) Comment (2–3 sentences): How does advertising status appear to shift the relationship (level shift, slope change, or no clear difference)?


C. Titanic Survival (titanic)

Use the following dataset for Questions 9–13.

titanic <- read_csv("https://bcdanl.github.io/data/titanic_cleaned.csv")

Question 9. Two-way count table

Create titanic_class_survival that counts passengers by class and survived.

# YOUR CODE HERE


Question 10. Age distribution by class and gender (ggplot + comment)

Visualize how the distribution of age varies across class and gender.

(a) Provide your ggplot2 code:

# YOUR CODE HERE

(b) Comment (2–3 sentences): What differences do you see across class and gender?


Question 11. Survival rate by class and gender (data + ggplot + comment)

  1. Create a summary table with the survival rate (proportion survived) by class and gender.
  2. Visualize the survival rates.

(a) Provide your dplyr code for the summary table:

# YOUR CODE HERE

(b) Provide your ggplot2 code:

# YOUR CODE HERE

(c) Comment (2–3 sentences): Which groups have the highest/lowest survival rates?


Question 12. Conditional distribution of survived (ggplot + comment)

Create a plot that shows the distribution of survived across class and gender using proportions (not raw counts).

(a) Provide your ggplot2 code:

# YOUR CODE HERE

(b) Comment (2–3 sentences): How does using proportions (instead of counts) change your interpretation?


Question 13. Short interpretation

Write 3 bullet-point insights supported by your tables/figures in Questions 9–12.


D. NYC Dog Licenses (nyc_dogs)

Use the following dataset for Questions 14–16.

nyc_dogs <- read_csv("https://bcdanl.github.io/data/nyc_dogs_cleaned.csv")

Question 14. Breed frequency table

Create nyc_dogs_breeds with:

  • non-missing breed,
  • n >= 2000, and
  • sorted by n descending.
# YOUR CODE HERE


Question 15. Visualize the distribution of breeds (ggplot + comment)

Using nyc_dogs_breeds, make a figure that shows the distribution of breed counts.

(a) Provide your ggplot2 code:

# YOUR CODE HERE

(b) Comment (2–3 sentences): Is the distribution concentrated in a few breeds or spread out?


Question 16. “Top breeds” focus (ggplot + comment)

Create a plot of the top 10 breeds (by count) and comment on what you observe.

(a) Provide your dplyr + ggplot2 code:

# YOUR CODE HERE

(b) Comment (2–3 sentences): Any surprises? What might explain the pattern?




Part 2. Data Transformation (nyc_payroll_2025)

For Questions 17–27, use nyc_payroll_2025.

For variable descriptions, see: Citywide Payroll Data (Fiscal Year) on NYC Open Data.

nyc_payroll_2025 <- read_csv("https://bcdanl.github.io/data/nyc_payroll_2025.zip")

Question 17. Base salary by borough (filter + summarise)

Compute the mean and standard deviation of Base_Salary for workers whose Work_Location_Borough is:

  • "MANHATTAN"
  • "QUEENS"

Report the two means and two SDs in your write-up.

# YOUR CODE HERE


Question 18. High base salary filter

Filter the data to keep only records where Base_Salary >= 100000. Report how many rows remain.

# YOUR CODE HERE


Question 19. Distinct agency–title pairs

Select only distinct combinations of Agency_Name and Title_Description.

# YOUR CODE HERE


Question 20. Top paid by regular gross pay

Arrange employees by Regular_Gross_Paid in descending order and show the top 10 rows with:

  • First_Name, Last_Name
  • Agency_Name
  • Title_Description
  • Regular_Gross_Paid
# YOUR CODE HERE


Question 21. Select + rename

Select Title_Description and rename it to Title. Also select Agency_Name, First_Name, Last_Name, and Base_Salary.

# YOUR CODE HERE


Question 22. Create new pay variables (mutate())

Use mutate() to create two new variables:

  • Total_Pay = Regular_Gross_Paid + Total_OT_Paid + Total_Other_Pay
  • OT_Share = Total_OT_Paid / Total_Pay

Then show the first 10 rows of:

  • First_Name, Last_Name
  • Agency_Name
  • Base_Salary
  • Total_Pay, OT_Share
# YOUR CODE HERE


Question 23. Borough pay summary (group_by() + summarise())

Using the variables you created in Question 22, group the data by Work_Location_Borough and compute:

  • number of employees n
  • mean Base_Salary
  • mean Total_Pay
  • mean OT_Share (ignore missing values)

Arrange the results by mean Total_Pay in descending order and show the summary table.

# YOUR CODE HERE


Question 24. Police Department overtime

Filter to Agency_Name == "POLICE DEPARTMENT" and arrange by Total_OT_Paid.

  1. Show the 10 smallest overtime values.
  2. Show the 10 largest overtime values.
# YOUR CODE HERE


Question 25. Per annum employees

Filter to Pay_Basis == "per Annum" and select only:

  • First_Name, Last_Name, Base_Salary
# YOUR CODE HERE


Question 26. Borough then salary

Arrange by Work_Location_Borough (ascending) and then Base_Salary (descending). Then show the first 15 rows.

# YOUR CODE HERE


Question 27. Remove missing last names + count

Remove observations where Last_Name is missing (NA). Then report the remaining number of rows.

# YOUR CODE HERE



Part 3. Quarto Blogging (ice_cream)

Use the following dataset for your blog post.

ice_cream <- read_csv("https://bcdanl.github.io/data/ben-and-jerry-cleaned.csv")

Write and publish a blog post about Ben & Jerry’s ice cream using ice_cream.

Your post must include:

  • At least 4 ggplot2 figures
    • Each figure must be followed by a short interpretation paragraph (2–6 sentences).
  • At least 2 summary tables created with group_by() + summarise() (or count() + related verbs).
  • Evidence of:
    • filtering,
    • sorting,
    • creating at least one new variable (mutate()), and
    • using at least one facet (facet_wrap() or facet_grid()).
  • A clear data story structure:
    1. a motivating question,
    2. what you did (briefly),
    3. what you found (with evidence),
    4. a takeaway / conclusion.

Important: Your plots and text should work together. Avoid disconnected charts.

Back to top