<- read_csv(
hdi_corruption 'https://bcdanl.github.io/data/hdi_corruption.csv')
Homework 2
ggplot
Visualization; Quarto Blogging
Direction
Please submit your Quarto Document for Part 1 in Homework 2 to Brightspace with the name below:
danl-310-hw2-LASTNAME-FIRSTNAME.qmd
( e.g.,danl-310-hw2-choe-byeonghak.qmd
)
The due is March 24, 2025, 2:00 P.M.
- It is recommended to finish it before the Spring Break begins.
Please send Byeong-Hak an email (
bchoe@geneseo.edu
) if you have any questions.
Part 1. ggplot
visualization
Question 1
Use the following data.frame for Question 1.
Click to Check the Answer!
<- c("Germany", "Norway", "United States",
country_highlight "Greece", "Singapore",
"Argentina", "Senegal",
"China", "Egypt", "South Africa")
<- hdi_corruption |>
corruption mutate(label = ifelse(country %in% country_highlight, country, NA))
ggplot(data = corruption |> filter(year == 2014),
aes(cpi, hdi)) +
geom_smooth(method = lm,
formula = "y ~ log(x)",
se = F) +
geom_point(
aes(color = region,
fill = region),
size = 2.5,
alpha = 0.5,
shape = 21
+
) geom_text_repel(
aes(label = label),
color = "black",
size = 4,
box.padding = .75
+
) scale_y_continuous(
limits = c(0.3, 1.05),
breaks = c(0.2, 0.4, 0.6, 0.8, 1.0),
name = "Human Development Index, 2014\n(1.0 = most developed)"
+
) scale_x_continuous(
limits = c(10, 95),
breaks = c(20, 40, 60, 80, 100),
name = "Corruption Perceptions Index, 2014 (100 = least corrupt)"
+
) guides(color = guide_legend(nrow = 2)) +
theme_minimal() +
theme(
plot.margin = unit( c(1.75, .75, .75, .5), "cm"),
legend.position = c(.5, 1.05),
legend.direction = "horizontal",
legend.text = element_text(size = 10)
+
) labs( color = "Region",
fill = "Region")
Question 2
Download the file,
labor_supply.zip
, from this link. Then, extractlabor_supply.zip
, so that you can access thelabor_supply.csv
file.Variable description in
labor_supply.csv
SEX
: 1 if Male; 2 if Female; 9 if NIU (Not in universe)NCHLT5
: Number of own children under age 5 in a household; 9 if 9+LABFORCE
: 0 if NIU or members of the armed forces; 1 if not in the labor force; 2 if in the labor force.ASECWT
: sample weight
A sample weight of each observation means how much population each observation represents.
- If you sum
ASECWT
for each year, you get the size of yearly population in the US.
- If you sum
Households with
LABFORCE == 0
is not in labor force.Labor force participation rate can be calculated by:
\[ (\text{Labor Force Participation Rate}) \, = \, \frac{(\text{Size of population in labor force})}{(\text{Size of civilian population that are not members of the armed force})} \]
Click to Check the Answer!
<- '/Users/bchoe/My Drive/suny-geneseo/teaching-materials/lecture-data/labor_supply.csv'
path
<- read_csv(path)
cps_labor
<- cps_labor |>
cps_labor filter(YEAR >= 1982) |>
filter(LABFORCE != 0) |>
mutate(LABFORCE = LABFORCE - 1) |>
mutate(labor_supply = LABFORCE * ASECWT,
child = ifelse(NCHLT5 == 0,
"No Child Under Age 5 in Household",
"Having Children Under Age 5 in Household")) |>
group_by(YEAR, SEX, child) |>
summarize(pct = sum(labor_supply) / sum(ASECWT, na.rm= T) ) |>
filter(!is.na(child))
<- filter(cps_labor, YEAR == 2022) |>
label_df mutate(label = ifelse(SEX == 1, "Male", "Female"))
ggplot(data = cps_labor,
aes(x = YEAR, y = pct,
color = factor(SEX))) +
geom_line(size = 1.5) +
geom_label_repel(data = label_df,
aes(x = YEAR, y = pct, label = label),
color = 'black',
nudge_y = .033, box.padding = .5) +
facet_grid( . ~ factor(child)) +
scale_x_continuous( breaks = seq(1982, 2022, 4) ) +
scale_y_continuous( labels = scales::percent) +
scale_color_manual( labels = c("Male", "Female"),
values = c("#2E74C0", "#CB454A") ) +
labs(x = "Year",
y = "Labor Force Participation Rate",
color = "Gender",
lty = "Young Children",
title = "Fertility and Labor Supply in the U.S.",
subtitle = "1982-2022",
caption = "Data: IPUMS-CPS, University of Minnesota, www.ipums.org.") + guides(color = "none") +
theme_ipsum() +
theme(axis.title.y = element_text(size = rel(1.5),
face = 'bold'),
axis.text.x = element_text(angle = 45))
Question 3
Use the following data.frame for Question 3.
<- read_csv(
starbucks 'https://bcdanl.github.io/data/starbucks.csv')
Variable description
Product_Name
: Product NameSize
: Size of drink (short, tall, grande, venti)Milk
: Milk Type type of milk used0
none1
nonfat2
2%3
soy4
coconut5
whole
Whip
: Whip added or not (binary 0/1)Serv_Size_mL
: Serving size in mlCalories
: KCalTotal_Fat_g
: Total fat gramsSaturated_Fat_g
: Saturated fat gramsTrans_Fat_g
: Trans fat gramsCholesterol_mg
: Cholesterol mgSodium_mg
: Sodium milligramsTotal_Carbs_g
: Total Carbs gramsFiber_g
: Fiber gramsSugar_g
: Sugar gramsCaffeine_mg
: Caffeine in milligrams
Q3a.
- Add the following two variables to
starbucks
data.framecaffeine_mgml
: Caffeine in milligrams per mLcalories_kcml
: Calories KCal per mL
Click to Check the Answer!
<- starbucks |>
starbucks1 mutate(caffeine_mgml = caffeine_mg / serv_size_m_l,
calories_kcml = calories/ serv_size_m_l,
.before = 1) |>
select(product_name, size, serv_size_m_l, milk, whip, caffeine_mgml, calories_kcml)
Q3b.
- Calculate a mean
caffeine_mgml
and a meancalories_kcml
for eachproduct_name
.
Click to Check the Answer!
<- starbucks1 |>
starbucks2 group_by(product_name) |>
summarise(caffeine_mgml = mean(caffeine_mgml, na.rm = T),
calories_kcml = mean(calories_kcml, na.rm = T)
|>
) mutate(rank_caffeine = dense_rank(-caffeine_mgml),
rank_calories = dense_rank(-calories_kcml)) |>
filter(rank_caffeine <= 10 |rank_calories <= 10 )
Q3c.
For the top 10
product_name
in terms ofcaffeine_mgml
and the top 10product_name
in terms ofcalories_kcml
, replicate the above ggplot.Use the following commands for showing texts in the plot:
# install.packages("showtext")
library(showtext)
showtext_auto()
font_add_google("Annie Use Your Telescope", "annie")
- Use the following
annotate()
geom to insert the starbucks image in the plot:
# install.packages("ggtext")
library(ggtext)
annotate("richtext",
x = ,
y = ,
label = "<img src='https://bcdanl.github.io/lec_figs/starbucks.png' width='100'/>",
fill = ,
size = ,
color = )
- Use the following
geom_text_repel()
geom to use theannie
font
geom_text_repel(max.overlaps = ,
size = ,
min.segment.length = ,
point.padding = ,
box.padding = ,
show.legend = ,
family = "annie")
- Use the color,
#00704A
, for the title.
Answer:
Click to Check the Answer!
<- starbucks2 |> filter(rank_caffeine <= 10)
starbucks_caffeine <- starbucks2 |> filter(rank_calories <= 10)
starbucks_calories
|>
starbucks2 ggplot(aes(x = calories_kcml, y = caffeine_mgml,
color = product_name, label = product_name)) +
geom_point(show.legend = FALSE) +
geom_rect(aes(xmin = min(starbucks_caffeine$calories_kcml),
xmax = max(starbucks_caffeine$calories_kcml),
ymin = min(starbucks_caffeine$caffeine_mgml),
ymax = max(starbucks_caffeine$caffeine_mgml)),
fill = "#27251F",
color = NA,
alpha = 0.004)+
geom_rect(aes(xmin = min(starbucks_calories$calories_kcml),
xmax = max(starbucks_calories$calories_kcml),
ymin = min(starbucks_calories$caffeine_mgml),
ymax = max(starbucks_calories$caffeine_mgml)),
fill = "#27251F",
color = NA,
alpha = 0.004)+
geom_text_repel(max.overlaps = 6,
size = 4,
min.segment.length = 0.1,
point.padding = 0.4,
box.padding = 0.5,
show.legend = FALSE,
family = "annie") +
annotate("richtext",
x = quantile(starbucks2$calories_kcml, probs = .78),
y = quantile(starbucks2$caffeine_mgml, probs = .78),
label = "<img src='https://bcdanl.github.io/lec_figs/starbucks.png' width='100'/>",
fill = NA,
size = rel(2.25),
color = NA) +
scale_x_continuous(breaks = seq(0, 1, .2)) +
scale_y_continuous(breaks = seq(0, 1, .2)) +
labs(y = expression(paste("Caffeine (mg mL"^"-1",")")),
x = expression(paste("Calories (Kcal mL"^"-1",")")),
caption = "Source: Starbucks Coffee Company Beverage Nutrition Information",
title = "STARBUCKS DRINKS",
subtitle = "Caffeine or Calories, which one you would go?") +
theme_ipsum() +
theme(plot.title = element_text(color = "#00704A",
face = 'bold',
size = rel(2.5)),
plot.subtitle = element_text(face = 'bold',
size = rel(1.75)),
axis.title.x = element_text(face = 'bold',
size = rel(1.25)),
axis.title.y = element_text(face = 'bold',
size = rel(1.25)))
Part 2. Quarto Blogging
Use the following data.frame for Quarto Blogging:
<- read_csv('https://bcdanl.github.io/data/ben-and-jerry-cleaned.csv') ice_cream
Variable Description
Variable Name | Type | Description | Categories / Notes |
---|---|---|---|
priceper1 |
Numeric | The unit price for one product serving (in dollars). | |
flavor_descr |
Categorical | Description of the ice cream flavor. | |
size1_descr |
Categorical | The description of the package or serving size (in fluid ounces or a related measure). | |
household_id |
Categorical/ID | Unique identifier for each household purchasing product(s). | |
household_income |
Categorical/Numeric | The household income bracket (in dollars). | The number (e.g., 60,000) corresponds to predefined income categories. (for instance, “60,000” represents a particular income range from $60,000 to $70,000). Not a categorical variable per se, but can be grouped into categories if desired. |
household_size |
Categorical | The number of persons in the household. | Discrete numeric count (e.g., 1, 2, 3…). Not a categorical variable per se, but can be grouped into categories if desired. |
usecoup |
Categorical | Indicates whether a coupon was used in the purchase. | Boolean category: True (coupon used) or False (coupon not used). |
couponper1 |
Numeric | The discount amount per unit applied when a coupon is used. | Continuous numeric value; often zero if no coupon is used, and a positive number when a discount is applied. |
region |
Categorical | Geographic region where the household is located. | Categories include “East”, “Central”, “West”, and “South”. |
married |
Categorical | Marital status of the household head (or the household overall status). | Boolean category: True (married) or False (not married). |
race |
Categorical | Race of the household head or primary respondent. | Categories include “white”, “black”, “asian”, and “other”. |
hispanic_origin |
Categorical | Indicates whether the household identifies as of Hispanic origin. | Boolean category: True (Hispanic origin) or False (not Hispanic). |
microwave |
Categorical | Whether the household owns a microwave. | Boolean category: True (owns microwave) or False (does not own one). |
dishwasher |
Categorical | Whether the household owns a dishwasher. | Boolean category: True (owns dishwasher) or False (does not own one). |
sfh |
Categorical | Whether the household resides in a single-family home. | Boolean category: True (single-family home) or False (does not live in a single-family home). |
internet |
Categorical | Whether the household has internet service. | Boolean category: True (has internet) or False (does not). |
tvcable |
Categorical | Indicates whether the household subscribes to cable television service. | Boolean category: True (has cable TV) or False (does not). |
- Write a blog post about Ben and Jerry’s ice cream in the
ice_cream
data.frame using Quarto Document, and add it to your online blog.- In your blog post, utilize at least two ggplot figures with the use of theme, scale, and guides, as well as descriptive statistics, counting, filtering, and various group operations.