library(tidyverse)
library(skimr)Homework 1
ggplot2 · dplyr Fundamentals
📌 Directions
Submit one Quarto document (
.qmd) to Brightspace:danl-310-hw1-LASTNAME-FIRSTNAME.qmd
(e.g.,danl-310-hw1-choe-byeonghak.qmd)
Due: February 18, 2026, 11:59 P.M. (ET)
For visualization questions, you must provide:
- the
ggplot2code, and
- a written comment (2–4 sentences) interpreting the corresponding figure.
- the
Unless a question says otherwise, use
dplyrverbs (filter(),distinct(),select(),mutate(),group_by(),summarise(),arrange(),count(), etc.) andggplot2.
✅ Setup
Part 1. Data Visualization & Summaries
A. Orange Juice Promotions (oj)

Use the following dataset for Questions 1–6.
oj <- read_csv("https://bcdanl.github.io/data/dominick_oj_na.csv")Question 1. Quick inspection
- Print the first 10 rows of
oj. - Use
skimr::skim()(or another method) to summarize the variables.
# YOUR CODE HEREQuestion 2. Brand-level descriptive statistics (filter + group summary)
Filter to the three brands: tropicana, minute.maid, and dominicks. Then compute mean and standard deviation of:
salesprice
for each brand.
# YOUR CODE HEREWrite 2–3 sentences comparing the three brands based on your results.
Question 3. Remove missing values
Create a new data frame oj_no_na that removes observations with missing values in either price or sales.
- Show the number of rows in
ojandoj_no_na. - Report how many rows were removed.
# YOUR CODE HEREQuestion 4. Price distribution by brand (ggplot + comment)
Using oj_no_na, make a figure that compares the distribution of price across brand.
- Choose ONE main approach (e.g., boxplot, violin plot, density plot, histogram with facets).
- Include clear axis labels and a readable title.
(a) Provide your ggplot2 code:
# YOUR CODE HERE(b) Comment (2–3 sentences): What differences do you see across brands (center, spread, skewness, outliers, etc.)?
Question 5. Log–log price–sales relationship by brand (ggplot + comment)
Using oj_no_na, visualize how the relationship between:
log10(sales)andlog10(price)
varies by brand.
- Use a scatter plot with transparency (e.g.,
alpha = 0.3). - Add a fitted line (e.g.,
geom_smooth(method = "lm", se = FALSE)), and - Use faceting OR color to distinguish brands.
(a) Provide your ggplot2 code:
# YOUR CODE HERE(b) Comment (2–3 sentences): Do you see evidence that higher prices are associated with lower sales? Does the pattern differ by brand?
Question 6. Log–log relationship by brand and ad status (ggplot + comment)
Now extend Question 5 by incorporating ad_status. Visualize how the relationship between:
log10(sales)andlog10(price)
varies by brand and ad_status.
- Use faceting (
facet_grid()orfacet_wrap()), and/or - Use color/shape for
ad_status.
(a) Provide your ggplot2 code:
# YOUR CODE HERE(b) Comment (2–3 sentences): How does advertising status appear to shift the relationship (level shift, slope change, or no clear difference)?
B. MLB Batting Trends (mlb_bat)

Use the following dataset for Questions 7–8.
mlb_bat <- read_csv("https://bcdanl.github.io/data/MLB_batting.csv")Question 7. Create yearly hit percentages (data transformation)
Create a data frame mlb_hit_pct that contains yearly hit percentages for each hit_type (Single, Double, Triple, HomeRun).
Your final table should have:
yearhit_typehit_pct(a percentage or proportion)
# YOUR CODE HEREQuestion 8. Visualize trends (ggplot + comment)
Make a figure that shows the yearly trends in hit percentages for each hit_type.
(a) Provide your ggplot2 code:
# YOUR CODE HERE(b) Comment (2–3 sentences): Which hit types are increasing/decreasing over time? Any notable breaks or eras?
C. Titanic Survival (titanic)

Use the following dataset for Questions 9–13.
titanic <- read_csv("https://bcdanl.github.io/data/titanic_cleaned.csv")Question 9. Two-way count table
Create titanic_class_survival that counts passengers by class and survived.
# YOUR CODE HEREQuestion 10. Age distribution by class and gender (ggplot + comment)
Visualize how the distribution of age varies across class and gender.
(a) Provide your ggplot2 code:
# YOUR CODE HERE(b) Comment (2–3 sentences): What differences do you see across class and gender?
Question 11. Survival rate by class and gender (data + ggplot + comment)
- Create a summary table with the survival rate (proportion survived) by
classandgender. - Visualize the survival rates.
(a) Provide your dplyr code for the summary table:
# YOUR CODE HERE(b) Provide your ggplot2 code:
# YOUR CODE HERE(c) Comment (2–3 sentences): Which groups have the highest/lowest survival rates?
Question 12. Conditional distribution of survived (ggplot + comment)
Create a plot that shows the distribution of survived across class and gender using proportions (not raw counts).
(a) Provide your ggplot2 code:
# YOUR CODE HERE(b) Comment (2–3 sentences): How does using proportions (instead of counts) change your interpretation?
Question 13. Short interpretation
Write 3 bullet-point insights supported by your tables/figures in Questions 9–12.
D. NYC Dog Licenses (nyc_dogs)

Use the following dataset for Questions 14–16.
nyc_dogs <- read_csv("https://bcdanl.github.io/data/nyc_dogs_cleaned.csv")Question 14. Breed frequency table
Create nyc_dogs_breeds with:
- non-missing
breed, n >= 2000, and- sorted by
ndescending.
# YOUR CODE HEREQuestion 15. Visualize the distribution of breeds (ggplot + comment)
Using nyc_dogs_breeds, make a figure that shows the distribution of breed counts.
(a) Provide your ggplot2 code:
# YOUR CODE HERE(b) Comment (2–3 sentences): Is the distribution concentrated in a few breeds or spread out?
Question 16. “Top breeds” focus (ggplot + comment)
Create a plot of the top 10 breeds (by count) and comment on what you observe.
(a) Provide your dplyr + ggplot2 code:
# YOUR CODE HERE(b) Comment (2–3 sentences): Any surprises? What might explain the pattern?
Part 2. Data Transformation (nyc_payroll_2025)

For Questions 17–27, use nyc_payroll_2025.
For variable descriptions, see: Citywide Payroll Data (Fiscal Year) on NYC Open Data.
nyc_payroll_2025 <- read_csv("https://bcdanl.github.io/data/nyc_payroll_2025.zip")Question 17. Base salary by borough (filter + summarise)
Compute the mean and standard deviation of Base_Salary for workers whose Work_Location_Borough is:
"MANHATTAN""QUEENS"
Report the two means and two SDs in your write-up.
# YOUR CODE HEREQuestion 18. High base salary filter
Filter the data to keep only records where Base_Salary >= 100000. Report how many rows remain.
# YOUR CODE HEREQuestion 19. Distinct agency–title pairs
Select only distinct combinations of Agency_Name and Title_Description.
# YOUR CODE HEREQuestion 20. Top paid by regular gross pay
Arrange employees by Regular_Gross_Paid in descending order and show the top 10 rows with:
First_Name,Last_NameAgency_NameTitle_DescriptionRegular_Gross_Paid
# YOUR CODE HEREQuestion 21. Select + rename
Select Title_Description and rename it to Title. Also select Agency_Name, First_Name, Last_Name, and Base_Salary.
# YOUR CODE HEREQuestion 22. Create new pay variables (mutate())
Use mutate() to create two new variables:
Total_Pay = Regular_Gross_Paid + Total_OT_Paid + Total_Other_PayOT_Share = Total_OT_Paid / Total_Pay
Then show the first 10 rows of:
First_Name,Last_NameAgency_NameBase_SalaryTotal_Pay,OT_Share
# YOUR CODE HEREQuestion 23. Borough pay summary (group_by() + summarise())
Using the variables you created in Question 22, group the data by Work_Location_Borough and compute:
- number of employees
n - mean
Base_Salary - mean
Total_Pay - mean
OT_Share(ignore missing values)
Arrange the results by mean Total_Pay in descending order and show the summary table.
# YOUR CODE HEREQuestion 24. Police Department overtime
Filter to Agency_Name == "POLICE DEPARTMENT" and arrange by Total_OT_Paid.
- Show the 10 smallest overtime values.
- Show the 10 largest overtime values.
# YOUR CODE HEREQuestion 25. Per annum employees
Filter to Pay_Basis == "per Annum" and select only:
First_Name,Last_Name,Base_Salary
# YOUR CODE HEREQuestion 26. Borough then salary
Arrange by Work_Location_Borough (ascending) and then Base_Salary (descending). Then show the first 15 rows.
# YOUR CODE HEREQuestion 27. Remove missing last names + count
Remove observations where Last_Name is missing (NA). Then report the remaining number of rows.
# YOUR CODE HEREPart 3. Quarto Blogging (ice_cream)

Use the following dataset for your blog post.
ice_cream <- read_csv("https://bcdanl.github.io/data/ben-and-jerry-cleaned.csv")Write and publish a blog post about Ben & Jerry’s ice cream using ice_cream.
Your post must include:
- At least 4
ggplot2figures- Each figure must be followed by a short interpretation paragraph (2–6 sentences).
- At least 2 summary tables created with
group_by()+summarise()(orcount()+ related verbs). - Evidence of:
- filtering,
- sorting,
- creating at least one new variable (
mutate()), and - using at least one facet (
facet_wrap()orfacet_grid()).
- A clear data story structure:
- a motivating question,
- what you did (briefly),
- what you found (with evidence),
- a takeaway / conclusion.
Important: Your plots and text should work together. Avoid disconnected charts.