Homework 2

ggplot Visualization; Quarto Blogging

Author

Byeong-Hak Choe

Published

March 31, 2025

Modified

March 31, 2025

Direction

  • Please submit your Quarto Document for Part 1 in Homework 2 to Brightspace with the name below:

    • danl-310-hw2-LASTNAME-FIRSTNAME.qmd
      ( e.g., danl-310-hw2-choe-byeonghak.qmd )
  • The due is March 24, 2025, 2:00 P.M.

    • It is recommended to finish it before the Spring Break begins.
  • Please send Byeong-Hak an email (bchoe@geneseo.edu) if you have any questions.




Part 1. ggplot visualization

Question 1

Use the following data.frame for Question 1.

hdi_corruption <- read_csv(
  'https://bcdanl.github.io/data/hdi_corruption.csv')


Click to Check the Answer!
country_highlight <- c("Germany", "Norway", "United States", 
                       "Greece", "Singapore", 
                       "Argentina", "Senegal",
                       "China", "Egypt", "South Africa")

corruption <- hdi_corruption |> 
  mutate(label = ifelse(country %in% country_highlight, country, NA))


ggplot(data = corruption |> filter(year == 2014), 
       aes(cpi, hdi)) + 
  geom_smooth(method = lm, 
              formula = "y ~ log(x)", 
              se = F) + 
  geom_point(
    aes(color = region, 
        fill = region),
    size = 2.5, 
    alpha = 0.5,
    shape = 21
  ) + 
  geom_text_repel(
    aes(label = label), 
    color = "black", 
    size = 4,
    box.padding = .75
  ) +
  scale_y_continuous(
    limits = c(0.3, 1.05), 
    breaks = c(0.2, 0.4, 0.6, 0.8, 1.0),
    name = "Human Development Index, 2014\n(1.0 = most developed)"
  ) +
  scale_x_continuous(
    limits = c(10, 95),
    breaks = c(20, 40, 60, 80, 100),
    name = "Corruption Perceptions Index, 2014 (100 = least corrupt)"
  ) + 
  guides(color = guide_legend(nrow = 2)) +
  theme_minimal() + 
  theme(
    plot.margin = unit( c(1.75, .75, .75, .5), "cm"),
    legend.position = c(.5, 1.05),
    legend.direction = "horizontal",
    legend.text = element_text(size = 10)
  ) +
  labs( color = "Region", 
        fill = "Region") 



Question 2

  • Download the file, labor_supply.zip, from this link. Then, extract labor_supply.zip, so that you can access the labor_supply.csv file.

  • Variable description in labor_supply.csv

    • SEX: 1 if Male; 2 if Female; 9 if NIU (Not in universe)
    • NCHLT5: Number of own children under age 5 in a household; 9 if 9+
    • LABFORCE: 0 if NIU or members of the armed forces; 1 if not in the labor force; 2 if in the labor force.
    • ASECWT: sample weight
  • A sample weight of each observation means how much population each observation represents.

    • If you sum ASECWT for each year, you get the size of yearly population in the US.
  • Households with LABFORCE == 0 is not in labor force.

  • Labor force participation rate can be calculated by:

\[ (\text{Labor Force Participation Rate}) \, = \, \frac{(\text{Size of population in labor force})}{(\text{Size of civilian population that are not members of the armed force})} \]


Click to Check the Answer!
path <- '/Users/bchoe/My Drive/suny-geneseo/teaching-materials/lecture-data/labor_supply.csv'

cps_labor <- read_csv(path)

cps_labor <- cps_labor |> 
  filter(YEAR >= 1982) |> 
  filter(LABFORCE != 0) |> 
  mutate(LABFORCE = LABFORCE - 1) |> 
  mutate(labor_supply = LABFORCE * ASECWT,
         child = ifelse(NCHLT5 == 0, 
                        "No Child Under Age 5 in Household", 
                        "Having Children Under Age 5 in Household")) |> 
  group_by(YEAR, SEX, child) |> 
  summarize(pct = sum(labor_supply) / sum(ASECWT, na.rm= T) ) |> 
  filter(!is.na(child))
  
label_df <- filter(cps_labor, YEAR == 2022) |> 
  mutate(label = ifelse(SEX == 1, "Male", "Female"))

ggplot(data = cps_labor,
       aes(x = YEAR, y = pct, 
           color = factor(SEX))) +
  geom_line(size = 1.5) +
  geom_label_repel(data = label_df,
                  aes(x = YEAR, y = pct, label = label),
                  color = 'black',
                  nudge_y = .033, box.padding = .5) +
  facet_grid( . ~ factor(child)) +
  scale_x_continuous( breaks = seq(1982, 2022, 4) ) +
  scale_y_continuous( labels = scales::percent) +  
  scale_color_manual( labels = c("Male", "Female"),
                      values = c("#2E74C0", "#CB454A") ) +
  labs(x = "Year",
       y = "Labor Force Participation Rate",
       color = "Gender",
       lty = "Young Children",
       title = "Fertility and Labor Supply in the U.S.",
       subtitle = "1982-2022",
       caption = "Data: IPUMS-CPS, University of Minnesota, www.ipums.org.") +  guides(color = "none") +
  theme_ipsum() +
  theme(axis.title.y = element_text(size = rel(1.5),
                                    face = 'bold'),
        axis.text.x = element_text(angle = 45))



Question 3

Use the following data.frame for Question 3.

starbucks <- read_csv(
  'https://bcdanl.github.io/data/starbucks.csv')

Variable description

  • Product_Name: Product Name
  • Size: Size of drink (short, tall, grande, venti)
  • Milk: Milk Type type of milk used
    • 0 none
    • 1 nonfat
    • 2 2%
    • 3 soy
    • 4 coconut
    • 5 whole
  • Whip: Whip added or not (binary 0/1)
  • Serv_Size_mL: Serving size in ml
  • Calories: KCal
  • Total_Fat_g: Total fat grams
  • Saturated_Fat_g: Saturated fat grams
  • Trans_Fat_g: Trans fat grams
  • Cholesterol_mg: Cholesterol mg
  • Sodium_mg: Sodium milligrams
  • Total_Carbs_g: Total Carbs grams
  • Fiber_g: Fiber grams
  • Sugar_g: Sugar grams
  • Caffeine_mg: Caffeine in milligrams

Q3a.

  • Add the following two variables to starbucks data.frame
    • caffeine_mgml: Caffeine in milligrams per mL
    • calories_kcml: Calories KCal per mL
Click to Check the Answer!
starbucks1 <- starbucks |> 
  mutate(caffeine_mgml = caffeine_mg / serv_size_m_l,
         calories_kcml = calories/ serv_size_m_l,
         .before = 1) |> 
  select(product_name, size, serv_size_m_l, milk, whip, caffeine_mgml, calories_kcml)



Q3b.

  • Calculate a mean caffeine_mgml and a mean calories_kcml for each product_name.
Click to Check the Answer!
starbucks2 <- starbucks1 |> 
  group_by(product_name) |> 
  summarise(caffeine_mgml = mean(caffeine_mgml, na.rm = T),
            calories_kcml = mean(calories_kcml, na.rm = T)
  ) |> 
  mutate(rank_caffeine = dense_rank(-caffeine_mgml),
         rank_calories = dense_rank(-calories_kcml)) |> 
  filter(rank_caffeine <= 10 |rank_calories <= 10 )



Q3c.


  • For the top 10 product_name in terms of caffeine_mgml and the top 10 product_name in terms of calories_kcml, replicate the above ggplot.

  • Use the following commands for showing texts in the plot:

# install.packages("showtext")
library(showtext)
showtext_auto()
font_add_google("Annie Use Your Telescope", "annie")
  • Use the following annotate() geom to insert the starbucks image in the plot:
# install.packages("ggtext")
library(ggtext)

annotate("richtext", 
           x =  , 
           y =  , 
           label = "<img src='https://bcdanl.github.io/lec_figs/starbucks.png' width='100'/>", 
           fill =  ,
           size =  , 
           color =  )
  • Use the following geom_text_repel() geom to use the annie font
geom_text_repel(max.overlaps = ,
                size =  ,
                min.segment.length =  ,
                point.padding =  ,
                box.padding =  ,
                show.legend =  ,
                family = "annie")
  • Use the color, #00704A, for the title.

Answer:

Click to Check the Answer!
starbucks_caffeine <- starbucks2 |> filter(rank_caffeine <= 10)
starbucks_calories <- starbucks2 |> filter(rank_calories <= 10)


starbucks2 |> 
  ggplot(aes(x = calories_kcml, y = caffeine_mgml, 
             color = product_name, label = product_name)) +
  geom_point(show.legend = FALSE) +
  geom_rect(aes(xmin = min(starbucks_caffeine$calories_kcml), 
                xmax = max(starbucks_caffeine$calories_kcml), 
                ymin = min(starbucks_caffeine$caffeine_mgml), 
                ymax = max(starbucks_caffeine$caffeine_mgml)),
            fill = "#27251F", 
            color = NA,
            alpha = 0.004)+
  geom_rect(aes(xmin = min(starbucks_calories$calories_kcml), 
                xmax = max(starbucks_calories$calories_kcml), 
                ymin = min(starbucks_calories$caffeine_mgml), 
                ymax = max(starbucks_calories$caffeine_mgml)),
            fill = "#27251F", 
            color = NA,
            alpha = 0.004)+
  geom_text_repel(max.overlaps = 6,
                  size = 4,
                  min.segment.length = 0.1,
                  point.padding = 0.4,
                  box.padding = 0.5,
                  show.legend = FALSE,
                  family = "annie") +
  annotate("richtext", 
           x = quantile(starbucks2$calories_kcml, probs = .78), 
           y = quantile(starbucks2$caffeine_mgml, probs = .78), 
           label = "<img src='https://bcdanl.github.io/lec_figs/starbucks.png' width='100'/>", 
           fill = NA,
           size = rel(2.25), 
           color = NA) +
  scale_x_continuous(breaks = seq(0, 1, .2)) +
  scale_y_continuous(breaks = seq(0, 1, .2)) +
  labs(y = expression(paste("Caffeine (mg mL"^"-1",")")), 
       x = expression(paste("Calories (Kcal mL"^"-1",")")),
       caption = "Source: Starbucks Coffee Company Beverage Nutrition Information",
       title = "STARBUCKS DRINKS",
       subtitle = "Caffeine or Calories, which one you would go?") +
  theme_ipsum() +
  theme(plot.title = element_text(color = "#00704A",
                                  face = 'bold',
                                  size = rel(2.5)),
        plot.subtitle = element_text(face = 'bold',
                                  size = rel(1.75)),
        axis.title.x = element_text(face = 'bold',
                                  size = rel(1.25)),
        axis.title.y = element_text(face = 'bold',
                                  size = rel(1.25)))



Part 2. Quarto Blogging

Use the following data.frame for Quarto Blogging:



ice_cream <- read_csv('https://bcdanl.github.io/data/ben-and-jerry-cleaned.csv')


Variable Description

Variable Name Type Description Categories / Notes
priceper1 Numeric The unit price for one product serving (in dollars).
flavor_descr Categorical Description of the ice cream flavor.
size1_descr Categorical The description of the package or serving size (in fluid ounces or a related measure).
household_id Categorical/ID Unique identifier for each household purchasing product(s).
household_income Categorical/Numeric The household income bracket (in dollars). The number (e.g., 60,000) corresponds to predefined income categories. (for instance, “60,000” represents a particular income range from $60,000 to $70,000). Not a categorical variable per se, but can be grouped into categories if desired.
household_size Categorical The number of persons in the household. Discrete numeric count (e.g., 1, 2, 3…). Not a categorical variable per se, but can be grouped into categories if desired.
usecoup Categorical Indicates whether a coupon was used in the purchase. Boolean category: True (coupon used) or False (coupon not used).
couponper1 Numeric The discount amount per unit applied when a coupon is used. Continuous numeric value; often zero if no coupon is used, and a positive number when a discount is applied.
region Categorical Geographic region where the household is located. Categories include “East”, “Central”, “West”, and “South”.
married Categorical Marital status of the household head (or the household overall status). Boolean category: True (married) or False (not married).
race Categorical Race of the household head or primary respondent. Categories include “white”, “black”, “asian”, and “other”.
hispanic_origin Categorical Indicates whether the household identifies as of Hispanic origin. Boolean category: True (Hispanic origin) or False (not Hispanic).
microwave Categorical Whether the household owns a microwave. Boolean category: True (owns microwave) or False (does not own one).
dishwasher Categorical Whether the household owns a dishwasher. Boolean category: True (owns dishwasher) or False (does not own one).
sfh Categorical Whether the household resides in a single-family home. Boolean category: True (single-family home) or False (does not live in a single-family home).
internet Categorical Whether the household has internet service. Boolean category: True (has internet) or False (does not).
tvcable Categorical Indicates whether the household subscribes to cable television service. Boolean category: True (has cable TV) or False (does not).


  • Write a blog post about Ben and Jerry’s ice cream in the ice_cream data.frame using Quarto Document, and add it to your online blog.
    • In your blog post, utilize at least two ggplot figures with the use of theme, scale, and guides, as well as descriptive statistics, counting, filtering, and various group operations.


Back to top