Homework 2

ggplot Visualization; Quarto Blogging

Author

Byeong-Hak Choe

Published

April 5, 2025

Modified

April 5, 2025

Direction

Please submit your Quarto Document for Part 1 in Homework 2 to Brightspace with the name below:
- danl-310-hw2-LASTNAME-FIRSTNAME.qmd
  ( e.g., danl-310-hw2-choe-byeonghak.qmd )
The due is March 24, 2025, 2:00 P.M.
- It is recommended to finish it before the Spring Break begins.
Please send Byeong-Hak an email (bchoe@geneseo.edu) if you have any questions.

Descriptive Statistics

The descriptive statistics of scores for Homework 2 is shown below:

Statistic	Value
Mean	89.61
SD	13.41
Q1	87.25
Median	95.76
Q3	97.88

The following provides the descriptive statistics for each part of the Homework 2:

Part 1. `ggplot` visualization

Question 1

Use the following data.frame for Question 1.

hdi_corruption <- read_csv(
  'https://bcdanl.github.io/data/hdi_corruption.csv')

Click to Check the Answer!

country_highlight <- c("Germany", "Norway", "United States", 
                       "Greece", "Singapore", 
                       "Argentina", "Senegal",
                       "China", "Egypt", "South Africa")

corruption <- hdi_corruption |> 
  mutate(label = ifelse(country %in% country_highlight, country, NA))


ggplot(data = corruption |> filter(year == 2014), 
       aes(cpi, hdi)) + 
  geom_smooth(method = lm, 
              formula = "y ~ log(x)", 
              se = F) + 
  geom_point(
    aes(color = region, 
        fill = region),
    size = 2.5, 
    alpha = 0.5,
    shape = 21
  ) + 
  geom_text_repel(
    aes(label = label), 
    color = "black", 
    size = 4,
    box.padding = .75
  ) +
  scale_y_continuous(
    limits = c(0.3, 1.05), 
    breaks = c(0.2, 0.4, 0.6, 0.8, 1.0),
    name = "Human Development Index, 2014\n(1.0 = most developed)"
  ) +
  scale_x_continuous(
    limits = c(10, 95),
    breaks = c(20, 40, 60, 80, 100),
    name = "Corruption Perceptions Index, 2014 (100 = least corrupt)"
  ) + 
  guides(color = guide_legend(nrow = 2)) +
  theme_minimal() + 
  theme(
    plot.margin = unit( c(1.75, .75, .75, .5), "cm"),
    legend.position = c(.5, 1.05),
    legend.direction = "horizontal",
    legend.text = element_text(size = 10)
  ) +
  labs( color = "Region", 
        fill = "Region")

Question 2

Download the file, labor_supply.zip, from this link. Then, extract labor_supply.zip, so that you can access the labor_supply.csv file.
Variable description in labor_supply.csv
- SEX: 1 if Male; 2 if Female; 9 if NIU (Not in universe)
- NCHLT5: Number of own children under age 5 in a household; 9 if 9+
- LABFORCE: 0 if NIU or members of the armed forces; 1 if not in the labor force; 2 if in the labor force.
- ASECWT: sample weight
A sample weight of each observation means how much population each observation represents.
- If you sum ASECWT for each year, you get the size of yearly population in the US.
Households with LABFORCE == 0 is not in labor force.
Labor force participation rate can be calculated by:

\[ (\text{Labor Force Participation Rate}) \, = \, \frac{(\text{Size of population in labor force})}{(\text{Size of civilian population that are not members of the armed force})} \]

Click to Check the Answer!

path <- '/Users/bchoe/My Drive/suny-geneseo/teaching-materials/lecture-data/labor_supply.csv'

cps_labor <- read_csv(path)

cps_labor <- cps_labor |> 
  filter(YEAR >= 1982) |> 
  filter(LABFORCE != 0) |> 
  mutate(LABFORCE = LABFORCE - 1) |> 
  mutate(labor_supply = LABFORCE * ASECWT,
         child = ifelse(NCHLT5 == 0, 
                        "No Child Under Age 5 in Household", 
                        "Having Children Under Age 5 in Household")) |> 
  group_by(YEAR, SEX, child) |> 
  summarize(pct = sum(labor_supply) / sum(ASECWT, na.rm= T) ) |> 
  filter(!is.na(child))
  
label_df <- filter(cps_labor, YEAR == 2022) |> 
  mutate(label = ifelse(SEX == 1, "Male", "Female"))

ggplot(data = cps_labor,
       aes(x = YEAR, y = pct, 
           color = factor(SEX))) +
  geom_line(size = 1.5) +
  geom_label_repel(data = label_df,
                  aes(x = YEAR, y = pct, label = label),
                  color = 'black',
                  nudge_y = .033, box.padding = .5) +
  facet_grid( . ~ factor(child)) +
  scale_x_continuous( breaks = seq(1982, 2022, 4) ) +
  scale_y_continuous( labels = scales::percent) +  
  scale_color_manual( labels = c("Male", "Female"),
                      values = c("#2E74C0", "#CB454A") ) +
  labs(x = "Year",
       y = "Labor Force Participation Rate",
       color = "Gender",
       lty = "Young Children",
       title = "Fertility and Labor Supply in the U.S.",
       subtitle = "1982-2022",
       caption = "Data: IPUMS-CPS, University of Minnesota, www.ipums.org.") +  guides(color = "none") +
  theme_ipsum() +
  theme(axis.title.y = element_text(size = rel(1.5),
                                    face = 'bold'),
        axis.text.x = element_text(angle = 45))

Question 3

Use the following data.frame for Question 3.

starbucks <- read_csv(
  'https://bcdanl.github.io/data/starbucks.csv')

Variable description

Product_Name: Product Name
Size: Size of drink (short, tall, grande, venti)
Milk: Milk Type type of milk used
- 0 none
- 1 nonfat
- 2 2%
- 3 soy
- 4 coconut
- 5 whole
Whip: Whip added or not (binary 0/1)
Serv_Size_mL: Serving size in ml
Calories: KCal
Total_Fat_g: Total fat grams
Saturated_Fat_g: Saturated fat grams
Trans_Fat_g: Trans fat grams
Cholesterol_mg: Cholesterol mg
Sodium_mg: Sodium milligrams
Total_Carbs_g: Total Carbs grams
Fiber_g: Fiber grams
Sugar_g: Sugar grams
Caffeine_mg: Caffeine in milligrams

Q3a.

Add the following two variables to starbucks data.frame
- caffeine_mgml: Caffeine in milligrams per mL
- calories_kcml: Calories KCal per mL

Click to Check the Answer!

starbucks1 <- starbucks |> 
  mutate(caffeine_mgml = caffeine_mg / serv_size_m_l,
         calories_kcml = calories/ serv_size_m_l,
         .before = 1) |> 
  select(product_name, size, serv_size_m_l, milk, whip, caffeine_mgml, calories_kcml)

Q3b.

Calculate a mean caffeine_mgml and a mean calories_kcml for each product_name.

Click to Check the Answer!

starbucks2 <- starbucks1 |> 
  group_by(product_name) |> 
  summarise(caffeine_mgml = mean(caffeine_mgml, na.rm = T),
            calories_kcml = mean(calories_kcml, na.rm = T)
  ) |> 
  mutate(rank_caffeine = dense_rank(-caffeine_mgml),
         rank_calories = dense_rank(-calories_kcml)) |> 
  filter(rank_caffeine <= 10 |rank_calories <= 10 )

Q3c.

For the top 10 product_name in terms of caffeine_mgml and the top 10 product_name in terms of calories_kcml, replicate the above ggplot.
Use the following commands for showing texts in the plot:

# install.packages("showtext")
library(showtext)
showtext_auto()
font_add_google("Annie Use Your Telescope", "annie")

Use the following annotate() geom to insert the starbucks image in the plot:

# install.packages("ggtext")
library(ggtext)

annotate("richtext", 
           x =  , 
           y =  , 
           label = "<img src='https://bcdanl.github.io/lec_figs/starbucks.png' width='100'/>", 
           fill =  ,
           size =  , 
           color =  )

Use the following geom_text_repel() geom to use the annie font

geom_text_repel(max.overlaps = ,
                size =  ,
                min.segment.length =  ,
                point.padding =  ,
                box.padding =  ,
                show.legend =  ,
                family = "annie")

Use the color, #00704A, for the title.

Answer:

Click to Check the Answer!

starbucks_caffeine <- starbucks2 |> filter(rank_caffeine <= 10)
starbucks_calories <- starbucks2 |> filter(rank_calories <= 10)


starbucks2 |> 
  ggplot(aes(x = calories_kcml, y = caffeine_mgml, 
             color = product_name, label = product_name)) +
  geom_point(show.legend = FALSE) +
  geom_rect(aes(xmin = min(starbucks_caffeine$calories_kcml), 
                xmax = max(starbucks_caffeine$calories_kcml), 
                ymin = min(starbucks_caffeine$caffeine_mgml), 
                ymax = max(starbucks_caffeine$caffeine_mgml)),
            fill = "#27251F", 
            color = NA,
            alpha = 0.004)+
  geom_rect(aes(xmin = min(starbucks_calories$calories_kcml), 
                xmax = max(starbucks_calories$calories_kcml), 
                ymin = min(starbucks_calories$caffeine_mgml), 
                ymax = max(starbucks_calories$caffeine_mgml)),
            fill = "#27251F", 
            color = NA,
            alpha = 0.004)+
  geom_text_repel(max.overlaps = 6,
                  size = 4,
                  min.segment.length = 0.1,
                  point.padding = 0.4,
                  box.padding = 0.5,
                  show.legend = FALSE,
                  family = "annie") +
  annotate("richtext", 
           x = quantile(starbucks2$calories_kcml, probs = .78), 
           y = quantile(starbucks2$caffeine_mgml, probs = .78), 
           label = "<img src='https://bcdanl.github.io/lec_figs/starbucks.png' width='100'/>", 
           fill = NA,
           size = rel(2.25), 
           color = NA) +
  scale_x_continuous(breaks = seq(0, 1, .2)) +
  scale_y_continuous(breaks = seq(0, 1, .2)) +
  labs(y = expression(paste("Caffeine (mg mL"^"-1",")")), 
       x = expression(paste("Calories (Kcal mL"^"-1",")")),
       caption = "Source: Starbucks Coffee Company Beverage Nutrition Information",
       title = "STARBUCKS DRINKS",
       subtitle = "Caffeine or Calories, which one you would go?") +
  theme_ipsum() +
  theme(plot.title = element_text(color = "#00704A",
                                  face = 'bold',
                                  size = rel(2.5)),
        plot.subtitle = element_text(face = 'bold',
                                  size = rel(1.75)),
        axis.title.x = element_text(face = 'bold',
                                  size = rel(1.25)),
        axis.title.y = element_text(face = 'bold',
                                  size = rel(1.25)))

Part 2. Quarto Blogging

Use the following data.frame for Quarto Blogging:

ice_cream <- read_csv('https://bcdanl.github.io/data/ben-and-jerry-cleaned.csv')

Variable Description

Variable Name	Type	Description	Categories / Notes
`priceper1`	Numeric	The unit price for one product serving (in dollars).
`flavor_descr`	Categorical	Description of the ice cream flavor.
`size1_descr`	Categorical	The description of the package or serving size (in fluid ounces or a related measure).
`household_id`	Categorical/ID	Unique identifier for each household purchasing product(s).
`household_income`	Categorical/Numeric	The household income bracket (in dollars).	The number (e.g., 60,000) corresponds to predefined income categories. (for instance, “60,000” represents a particular income range from $60,000 to $70,000). Not a categorical variable per se, but can be grouped into categories if desired.
`household_size`	Categorical	The number of persons in the household.	Discrete numeric count (e.g., 1, 2, 3…). Not a categorical variable per se, but can be grouped into categories if desired.
`usecoup`	Categorical	Indicates whether a coupon was used in the purchase.	Boolean category: `True` (coupon used) or `False` (coupon not used).
`couponper1`	Numeric	The discount amount per unit applied when a coupon is used.	Continuous numeric value; often zero if no coupon is used, and a positive number when a discount is applied.
`region`	Categorical	Geographic region where the household is located.	Categories include “East”, “Central”, “West”, and “South”.
`married`	Categorical	Marital status of the household head (or the household overall status).	Boolean category: `True` (married) or `False` (not married).
`race`	Categorical	Race of the household head or primary respondent.	Categories include “white”, “black”, “asian”, and “other”.
`hispanic_origin`	Categorical	Indicates whether the household identifies as of Hispanic origin.	Boolean category: `True` (Hispanic origin) or `False` (not Hispanic).
`microwave`	Categorical	Whether the household owns a microwave.	Boolean category: `True` (owns microwave) or `False` (does not own one).
`dishwasher`	Categorical	Whether the household owns a dishwasher.	Boolean category: `True` (owns dishwasher) or `False` (does not own one).
`sfh`	Categorical	Whether the household resides in a single-family home.	Boolean category: `True` (single-family home) or `False` (does not live in a single-family home).
`internet`	Categorical	Whether the household has internet service.	Boolean category: `True` (has internet) or `False` (does not).
`tvcable`	Categorical	Indicates whether the household subscribes to cable television service.	Boolean category: `True` (has cable TV) or `False` (does not).

Write a blog post about Ben and Jerry’s ice cream in the ice_cream data.frame using Quarto Document, and add it to your online blog.
- In your blog post, utilize at least two ggplot figures with the use of theme, scale, and guides, as well as descriptive statistics, counting, filtering, and various group operations.

Direction

Descriptive Statistics

Part 1. ggplot visualization

Question 1

Question 2

Question 3

Variable description

Q3a.

Q3b.

Q3c.

Part 2. Quarto Blogging

Variable Description

Part 1. `ggplot` visualization