Homework 1

Linear Regression · Quarto Blogging

Author

Byeong-Hak Choe

Published

February 27, 2026

Modified

February 27, 2026

Directions

  • Submit your Quarto document for Part 1 to Brightspace using this filename format:

    • danl-320-hw1-LASTNAME-FIRSTNAME.qmd
      (e.g., danl-320-hw1-choe-byeonghak.qmd)
  • Due: February 11, 2026, 3:15 PM (ET)

  • Questions? Email Byeong-Hak .



Required R Packages

library(tidyverse)
library(stargazer)
library(broom)


Part 1. Linear Regression

Consider the beer_markets data:

library(tidyverse)

beer_markets <- read_csv(
  "https://bcdanl.github.io/data/beer_markets_all_cleaned.csv"
)


Variable Description

Variable Name Description
household Unique identifier for household
X_purchase_desc Description of beer purchase
quantity Number of beer packages purchased
brand Brand of beer purchased
dollar_spent Total amount spent on the purchase
beer_floz Total volume of beer purchased (in fluid ounces)
price_floz Price per fluid ounce of beer
container Type of beer container (e.g., CAN, BOTTLE)
promo Indicates if the purchase was part of a promotion (TRUE/FALSE)
region Geographic region of purchase
marital Marital status of household head
income Income level of the household
age Age group of household head
employment Employment status of household head
degree Education level of household head
occupation Occupation category of household head
ethnic Ethnicity of household head
microwave Indicates if the household owns a microwave (TRUE/FALSE)
dishwasher Indicates if the household owns a dishwasher (TRUE/FALSE)
tvcable Type of television subscription (e.g., basic, premium)
singlefamilyhome Indicates if the household is a single-family home (TRUE/FALSE)
npeople Number of people in the household



Question 1 — Filter the data (containers)

Create a data frame that keeps only observations where container is:

  • "CAN" or
  • "NON_REFILLABLE_BOTTLE"

Requirements:

  • Name your filtered data frame beer_sub.
  • Report the number of observations in beer_sub.
  • Report a frequency table of container in beer_sub.

With %in% operator

beer_sub <- beer_markets |> 
  filter(container %in% c("CAN", "NON_REFILLABLE_BOTTLE"))

With OR (|) operator

beer_sub <- beer_markets |> 
  filter(container == "CAN" | container == "NON_REFILLABLE_BOTTLE")
nrow(beer_sub)
[1] 72110
beer_sub |> 
  count(container)
# A tibble: 2 × 2
  container                 n
  <chr>                 <int>
1 CAN                   53015
2 NON_REFILLABLE_BOTTLE 19095


Question 2 — Train/test split

Split beer_sub into two data frames:

  • Training set (beer_train): about 67% of observations
  • Test set (beer_test): the remaining observations

Requirements:

  • Use a random seed for reproducibility (you choose the seed).
  • Report nrow(beer_train) and nrow(beer_test).
  • Briefly explain why we keep a test set.

With runif(n())

# 1) reproducible split
set.seed(1)  # you can choose any seed

# 2) sample row indices
beer_sub <- beer_sub |> 
  mutate(rnd = runif(n())) # n() number of random numbers from Unif(0,1) are drawn

# Setting reference levels
beer_sub <- beer_sub |> 
  mutate(market = factor(market),
         market = fct_relevel(market, "BUFFALO-ROCHESTER"),
         brand = factor(brand),
         brand = fct_relevel(brand, "BUD_LIGHT")
         )

# 2-1) sample row indices for training (about 67%)
beer_train <- beer_sub |> 
  filter(rnd >= 0.33)

# 2-1) sample row indices for test (about 33%)
beer_test <- beer_sub |> 
  filter(rnd < 0.33)


# 3) report sizes
nrow(beer_train)
[1] 48322
nrow(beer_test)
[1] 23788

With sample.int(n)

#| eval: false

# 1) reproducible split
set.seed(1)  # you can choose any seed

# 2) sample row indices for training (about 67%)
n <- nrow(beer_sub)
train_idx <- sample.int(n = n, size = round(0.67 * n), replace = FALSE)

beer_train <- beer_sub[train_idx, ]
beer_test  <- beer_sub[-train_idx, ]

# 3) report sizes
nrow(beer_train)
[1] 48314
nrow(beer_test)
[1] 23796



Questions 3–8 (Regression models)

You will estimate three linear regression models for:

  • Outcome: \(\log(\text{price\_floz})\)

Reference levels (required):

  • Set "BUD_LIGHT" as the reference level for brand.
  • Set "BUFFALO-ROCHESTER" as the reference level for market.
    • There are \(N+1\) distinct categories in market:


Model 1

\[ \begin{aligned} \log(\text{price\_floz}) = &\ \beta_{0} + \sum_{i=1}^{N} \beta_{i} \,\text{market}_{i} + \sum_{j=N+1}^{N+4} \beta_{j} \,\text{brand}_{j} + \beta_{N+5} \,\text{container\_CAN} \\ &\,+\, \beta_{N+6} \log(\text{beer\_floz}) + \epsilon \end{aligned} \]

Model 2 (brand-specific volume sensitivity)

\[ \begin{aligned} \log(\text{price\_floz}) \,=\, & \beta_{0} \,+\, \sum_{i=1}^{N}\beta_{i}\,\text{market}_{i} \,+\, \sum_{j=N+1}^{N+4}\beta_{j}\,\text{brand}_{j} \,+\, \beta_{N+5}\,\text{container\_CAN} \\ &\,+\, \beta_{N+6}\log(\text{beer\_floz})\\ &\,+\, \sum_{j=N+1}^{N+4}\beta_{j\times\text{beer\_floz}}\,\text{brand}_{j}\times \log(\text{beer\_floz})\\ &\,+\, \epsilon \end{aligned} \]

Model 3 (promo + brand-specific promo effects)

\[ \begin{aligned} \log(\text{price\_floz}) \,=\, & \beta_{0} \,+\, \sum_{i=1}^{N}\beta_{i}\,\text{market}_{i} \,+\, \sum_{j=N+1}^{N+4}\beta_{j}\,\text{brand}_{j} \,+\, \beta_{N+5}\,\text{container\_CAN} \\ &\,+\, \beta_{N+6}\log(\text{beer\_floz})\\ &\,+\, \beta_{N+7}\,\text{promo} \times\log(\text{beer\_floz}) \\ &\,+\, \sum_{j=N+1}^{N+4}\beta_{j\times\text{beer\_floz}}\,\text{brand}_{j}\times \log(\text{beer\_floz})\\ &\,+\, \sum_{j=N+1}^{N+4}\beta_{j\times\text{promo}}\,\text{brand}_{j}\times \text{promo}\\ &\,+\, \sum_{j=N+1}^{N+4}\beta_{j\times\text{promo}\times\text{beer\_floz}}\,\text{brand}_{j}\times \text{promo}\times \log(\text{beer\_floz})\\ &\,+\, \epsilon \end{aligned} \]




How to Interpret \(\beta\) Coefficients in Linear Regression

Assume \(d\) is a dummy variable (0/1). Interpretations are ceteris paribus (holding other variables constant).

1) Level–Level Model

\[y = \beta_1 x_1 + \beta_2 d\]

Interpretation
  • \(\beta_1\): a one-unit increase in \(x_1\) changes \(y\) by \(\beta_1\) units.
    • “If \(x_1\) increases by 1, expected \(y\) changes by \(\beta_1\).”
  • \(\beta_2\) (dummy): changing \(d\) from 0 to 1 changes \(y\) by \(\beta_2\) units.
    • “The \(d=1\) group has \(y\) that is \(\beta_2\) higher/lower than the \(d=0\) group.”

2) Log–Level Model (Semi-log)

\[\log(y) = \beta_1 x_1 + \beta_2 d\]

Interpretation (for \(x_1\))
  • \(\beta_1\): a one-unit increase in \(x_1\) changes \(y\) by approximately \(100\beta_1\%\).
    • More accurate:
    • \(\%\Delta y = 100\,(e^{\beta_1}-1)\)
Interpretation (for dummy \(d\))
  • \(\beta_2\): switching from \(d=0\) to \(d=1\) changes \(y\) by:
  • \(\%\Delta y = 100\,(e^{\beta_2}-1)\)
    • “The \(d=1\) group has \(y\) that is \(100(e^{\beta_2}-1)\%\) higher/lower than the \(d=0\) group.”

3) Log–Log Model (Elasticity Model)

\[\log(y) = \beta_1 \log(x_1) + \beta_2 d\]

Interpretation (for \(\log(x_1)\))
  • \(\beta_1\) is an elasticity:
    • a 1% increase in \(x_1\) is associated with a \(\beta_1\%\) change in \(y\).
    • “If \(x_1\) rises by 1%, expected \(y\) changes by \(\beta_1\%\).”
Interpretation (for dummy \(d\))
  • \(\beta_2\): switching from \(d=0\) to \(d=1\) changes \(y\) by:
  • \(\%\Delta y = 100\,(e^{\beta_2}-1)\)

Quick Summary

  • Level–Level: \(\beta\) = units of \(y\) per 1 unit of \(x\)
  • Log–Level: \(\beta \approx\) % change in \(y\) per 1 unit of \(x\)
  • Log–Log: \(\beta\) = % change in \(y\) per 1% change in \(x\) (elasticity)
  • Dummy in \(\log(y)\): percent difference is \(100(e^{\beta}-1)\%\)

Question 3 — Model intuition (conceptual)

For each model, explain in plain English:

  • What is the model trying to capture?
  • What kinds of differences across markets and brands are allowed?
  • How does the model treat the relationship between volume (beer_floz) and price (price_floz)?
  • What new flexibility is added when moving from Model 1 → 2 → 3?

Pricing Models: How Volume and Promotions Affect Price per fl oz

All three models share the same outcome variable — \(\log(\text{price\_per\_floz})\) — and the same core inputs: market fixed effects, brand fixed effects, a container type indicator, and package size \(\log(\text{beer\_floz})\). Because both price and volume are log-transformed, the coefficient on \(\log(\text{beer\_floz})\) is interpretable as an elasticity: a 1% increase in package size is associated with an \(\beta_{N+6}\)% change in price per fl oz. The models differ in how much that elasticity is allowed to vary.


Model 1: One common volume elasticity

\[ \log(\text{price\_per\_floz}) = \beta_0 + \sum_m \beta_m\,\text{market}_m + \sum_b \gamma_b\,\text{brand}_b + \delta\,\text{container}_{\text{CAN}} + \eta\,\log(\text{beer\_floz}) + \epsilon \]

Model 1 is the baseline. Markets, brands, and container type are each allowed to shift the level of price per fl oz up or down, but every brand shares the same single elasticity \(\eta\) with respect to package size. If \(\eta < 0\), larger packs are cheaper per fl oz on average — a standard quantity-discount pattern — but that pattern is assumed identical across all brands. This is the most restrictive specification: it treats beer pricing as essentially homogeneous in how per-unit price scales with size, which can mask real differences in pricing power between premium and value brands.


Model 2: Brand-specific volume elasticity

\[ \log(\text{price\_per\_floz}) = \beta_0 + \sum_m \beta_m\,\text{market}_m + \sum_b \gamma_b\,\text{brand}_b + \delta\,\text{container}_{\text{CAN}} + \eta\,\log(\text{beer\_floz}) + \sum_b \theta_b\!\left(\text{brand}_b \times \log(\text{beer\_floz})\right) + \epsilon \]

Model 2 adds brand-by-volume interactions, so each brand now has its own elasticity. For the reference brand the elasticity is \(\eta\); for any other brand \(b\) it is \(\eta + \theta_b\). This captures the intuition that a premium brand may hold its per-unit price nearly constant as pack size grows, while a value brand might offer steeper quantity discounts. The remaining restriction is that promotional status does not affect these elasticities — on-promotion and off-promotion observations are pooled within each brand.


Model 3: Brand × promotion-specific volume elasticity

\[ \begin{aligned} \log(\text{price\_per\_floz}) = &\ \beta_0 + \sum_m \beta_m\,\text{market}_m + \sum_b \gamma_b\,\text{brand}_b + \delta\,\text{container}_{\text{CAN}} + \eta\,\log(\text{beer\_floz}) \\ &+ \kappa\!\left(\text{promo}\times\log(\text{beer\_floz})\right) + \sum_b \theta_b\!\left(\text{brand}_b \times \log(\text{beer\_floz})\right) \\ &+ \sum_b \pi_b\!\left(\text{brand}_b \times \text{promo}\right) + \sum_b \lambda_b\!\left(\text{brand}_b \times \text{promo} \times \log(\text{beer\_floz})\right) + \epsilon \end{aligned} \]

Model 3 is the most flexible. It allows the volume elasticity to differ not only by brand but also by promotional status, and it lets that promotional shift vary by brand. Concretely:

Reference brand Brand \(b\)
No promo \(\eta\) \(\eta + \theta_b\)
On promo \(\eta + \kappa\) \(\eta + \theta_b + \kappa + \lambda_b\)

This matters because promotions often change the composition of pack sizes sold. A deal-seeking segment may trade up to larger packs when a value brand is on promotion, steepening its quantity-discount curve, while a premium brand’s elasticity may be largely unaffected by promo status. Model 3 can detect and quantify those differences.


Summary

Model 1 provides a clean, parsimonious benchmark with one shared volume response. Model 2 adds nuance by letting each brand have its own packaging sensitivity. Model 3 goes furthest by allowing that sensitivity to shift under promotions, and by letting the size of that shift vary by brand, making it the richest specification for understanding pricing and promotional dynamics.



Question 4 — Estimate, predict, and evaluate

Using beer_train:

  1. Fit Model 1, Model 2, and Model 3 using lm().
  2. Provide a regression table or summary() output for each model.
  3. Using beer_test, generate out-of-sample predictions:
    • predict \(\log(\text{price\_floz})\)
    • convert predictions back to price_floz using exp()
  4. Evaluate predictive performance on beer_test:
    • report RMSE (required)

Requirements:

  • Present results in a small table comparing Model 1–3 test RMSE.
  • Briefly comment on which model predicts best and by how much.
# Fit the models
m1 <- lm(data = beer_train,
         log(price_floz) ~ log(beer_floz) + brand + container + market)
m2 <- lm(data = beer_train,
         log(price_floz) ~ log(beer_floz) * brand + container + market)
m3 <- lm(data = beer_train,
         log(price_floz) ~ log(beer_floz) * brand * promo + container + market)


# Regression table
stargazer(m1, m2, m3, 
          type = "html", # which requires the code chunk option: `#| results: asis`
          digits = 4)  
Dependent variable:
log(price_floz)
(1) (2) (3)
log(beer_floz) -0.1412*** -0.1481*** -0.1442***
(0.0012) (0.0021) (0.0023)
brandBUSCH_LIGHT -0.2602*** -0.2163*** -0.1896***
(0.0028) (0.0219) (0.0231)
brandCOORS_LIGHT -0.0028 -0.0067 0.0018
(0.0024) (0.0182) (0.0193)
brandMILLER_LITE -0.0103*** 0.0710*** 0.1035***
(0.0022) (0.0166) (0.0178)
brandNATURAL_LIGHT -0.3184*** -0.6106*** -0.5438***
(0.0025) (0.0174) (0.0186)
promo -0.0786**
(0.0368)
containerNON_REFILLABLE_BOTTLE 0.0522*** 0.0509*** 0.0522***
(0.0019) (0.0019) (0.0019)
marketALBANY 0.0302** 0.0322*** 0.0235*
(0.0125) (0.0125) (0.0124)
marketATLANTA 0.0868*** 0.0855*** 0.0805***
(0.0101) (0.0100) (0.0099)
marketBALTIMORE 0.0983*** 0.1015*** 0.0901***
(0.0133) (0.0132) (0.0131)
marketBIRMINGHAM 0.1274*** 0.1340*** 0.1290***
(0.0103) (0.0103) (0.0102)
marketBOSTON 0.1307*** 0.1308*** 0.1265***
(0.0109) (0.0108) (0.0107)
marketCHARLOTTE 0.0284*** 0.0249** 0.0328***
(0.0101) (0.0100) (0.0100)
marketCHICAGO -0.0044 -0.0096 -0.0047
(0.0095) (0.0095) (0.0094)
marketCINCINNATI 0.0849*** 0.0797*** 0.0768***
(0.0101) (0.0100) (0.0100)
marketCLEVELAND 0.0663*** 0.0618*** 0.0567***
(0.0102) (0.0101) (0.0100)
marketCOLUMBUS 0.0750*** 0.0720*** 0.0719***
(0.0095) (0.0095) (0.0094)
marketDALLAS 0.2110*** 0.2208*** 0.2241***
(0.0094) (0.0094) (0.0093)
marketDENVER 0.1328*** 0.1307*** 0.1413***
(0.0109) (0.0109) (0.0108)
marketDES_MOINES 0.1380*** 0.1353*** 0.1281***
(0.0112) (0.0111) (0.0110)
marketDETROIT 0.0904*** 0.0870*** 0.0877***
(0.0096) (0.0096) (0.0095)
marketEXURBAN_NJ 0.2309*** 0.2273*** 0.2175***
(0.0161) (0.0161) (0.0160)
marketEXURBAN_NY 0.1146*** 0.1118*** 0.1024***
(0.0220) (0.0218) (0.0217)
marketGRAND_RAPIDS 0.0906*** 0.0862*** 0.0833***
(0.0113) (0.0112) (0.0111)
marketHARTFORD-NEW_HAVEN 0.1483*** 0.1463*** 0.1432***
(0.0138) (0.0137) (0.0136)
marketHOUSTON 0.1217*** 0.1189*** 0.1216***
(0.0097) (0.0096) (0.0095)
marketINDIANAPOLIS 0.0491*** 0.0477*** 0.0472***
(0.0101) (0.0101) (0.0100)
marketJACKSONVILLE 0.1287*** 0.1228*** 0.1226***
(0.0123) (0.0123) (0.0122)
marketKANSAS_CITY 0.0762*** 0.0713*** 0.0639***
(0.0115) (0.0115) (0.0114)
marketLITTLE_ROCK 0.0976*** 0.0935*** 0.0883***
(0.0127) (0.0127) (0.0126)
marketLOS_ANGELES 0.0389*** 0.0328*** 0.0407***
(0.0097) (0.0097) (0.0096)
marketLOUISVILLE 0.0652*** 0.0610*** 0.0641***
(0.0110) (0.0109) (0.0108)
marketMEMPHIS 0.1320*** 0.1304*** 0.1226***
(0.0122) (0.0121) (0.0120)
marketMIAMI 0.1106*** 0.1097*** 0.1098***
(0.0092) (0.0091) (0.0090)
marketMILWAUKEE 0.0277** 0.0262** 0.0277**
(0.0113) (0.0113) (0.0112)
marketMINNEAPOLIS 0.1470*** 0.1480*** 0.1422***
(0.0111) (0.0110) (0.0109)
marketNASHVILLE 0.1495*** 0.1486*** 0.1466***
(0.0105) (0.0105) (0.0104)
marketNEW_ORLEANS-MOBILE 0.1362*** 0.1269*** 0.1198***
(0.0108) (0.0107) (0.0106)
marketOKLAHOMA_CITY-TULSA 0.1501*** 0.1467*** 0.1385***
(0.0110) (0.0109) (0.0109)
marketOMAHA 0.1321*** 0.1292*** 0.1292***
(0.0104) (0.0104) (0.0103)
marketORLANDO 0.1047*** 0.1032*** 0.1043***
(0.0103) (0.0102) (0.0101)
marketPHILADELPHIA 0.1089*** 0.1080*** 0.0935***
(0.0128) (0.0128) (0.0127)
marketPHOENIX 0.1482*** 0.1514*** 0.1581***
(0.0093) (0.0093) (0.0092)
marketPITTSBURGH 0.1027*** 0.1008*** 0.0932***
(0.0136) (0.0135) (0.0134)
marketPORTLAND 0.1124*** 0.1108*** 0.1117***
(0.0120) (0.0119) (0.0118)
marketRALEIGH-DURHAM 0.0874*** 0.0881*** 0.0814***
(0.0103) (0.0102) (0.0101)
marketRICHMOND 0.0413*** 0.0387*** 0.0316***
(0.0104) (0.0103) (0.0102)
marketRURAL_ALABAMA 0.1695*** 0.1689*** 0.1631***
(0.0143) (0.0142) (0.0141)
marketRURAL_ARKANSAS 0.1722*** 0.1742*** 0.1646***
(0.0185) (0.0184) (0.0183)
marketRURAL_CALIFORNIA 0.0539*** 0.0501*** 0.0516***
(0.0109) (0.0108) (0.0107)
marketRURAL_COLORADO 0.2014*** 0.2008*** 0.2040***
(0.0428) (0.0425) (0.0422)
marketRURAL_FLORIDA 0.0572*** 0.0478*** 0.0451***
(0.0121) (0.0120) (0.0119)
marketRURAL_GEORGIA 0.1375*** 0.1332*** 0.1268***
(0.0128) (0.0127) (0.0126)
marketRURAL_IDAHO 0.1546*** 0.1488*** 0.1486***
(0.0181) (0.0180) (0.0179)
marketRURAL_ILLINOIS 0.0179* 0.0166 0.0130
(0.0102) (0.0102) (0.0101)
marketRURAL_INDIANA 0.0711*** 0.0729*** 0.0711***
(0.0124) (0.0123) (0.0122)
marketRURAL_IOWA 0.0617*** 0.0585*** 0.0538***
(0.0103) (0.0103) (0.0102)
marketRURAL_KANSAS 0.1308*** 0.1315*** 0.1214***
(0.0170) (0.0169) (0.0168)
marketRURAL_KENTUCKY 0.1529*** 0.1521*** 0.1485***
(0.0163) (0.0163) (0.0161)
marketRURAL_LOUISIANA 0.0827*** 0.0780*** 0.0665***
(0.0131) (0.0131) (0.0130)
marketRURAL_MAINE 0.0937*** 0.0914*** 0.0887***
(0.0134) (0.0134) (0.0133)
marketRURAL_MICHIGAN 0.0832*** 0.0801*** 0.0750***
(0.0111) (0.0111) (0.0110)
marketRURAL_MINNESOTA 0.1723*** 0.1767*** 0.1688***
(0.0195) (0.0194) (0.0192)
marketRURAL_MISSISSIPPI 0.0513*** 0.0459*** 0.0458***
(0.0140) (0.0139) (0.0138)
marketRURAL_MISSOURI 0.1059*** 0.1048*** 0.0970***
(0.0115) (0.0115) (0.0114)
marketRURAL_MONTANA 0.1199*** 0.1169*** 0.1239***
(0.0137) (0.0137) (0.0136)
marketRURAL_NEBRASKA 0.1476*** 0.1474*** 0.1418***
(0.0210) (0.0208) (0.0207)
marketRURAL_NEVADA 0.0451*** 0.0450*** 0.0425***
(0.0119) (0.0118) (0.0117)
marketRURAL_NEW_HAMPSHIRE 0.0654 0.0489 0.0435
(0.0441) (0.0439) (0.0435)
marketRURAL_NEW_MEXICO 0.1697*** 0.1657*** 0.1618***
(0.0129) (0.0129) (0.0128)
marketRURAL_NEW_YORK -0.0264 -0.0242 -0.0398
(0.0566) (0.0563) (0.0558)
marketRURAL_NORTH_CAROLINA 0.0221** 0.0505*** 0.0372***
(0.0108) (0.0108) (0.0107)
marketRURAL_NORTH_DAKOTA 0.2335*** 0.2309*** 0.2307***
(0.0198) (0.0197) (0.0196)
marketRURAL_OHIO 0.1062*** 0.1030*** 0.0998***
(0.0158) (0.0157) (0.0156)
marketRURAL_OKLAHOMA 0.1318*** 0.1330*** 0.1210***
(0.0278) (0.0276) (0.0274)
marketRURAL_OREGON -0.0057 -0.0077 -0.0097
(0.0346) (0.0344) (0.0341)
marketRURAL_PENNSYLVANIA 0.1265*** 0.1262*** 0.1147***
(0.0143) (0.0143) (0.0141)
marketRURAL_SOUTH_CAROLINA 0.0582*** 0.0568*** 0.0582***
(0.0100) (0.0100) (0.0099)
marketRURAL_SOUTH_DAKOTA 0.0842*** 0.0823*** 0.0763***
(0.0178) (0.0177) (0.0176)
marketRURAL_TENNESSEE 0.1766*** 0.1773*** 0.1795***
(0.0131) (0.0130) (0.0130)
marketRURAL_TEXAS 0.1722*** 0.1710*** 0.1667***
(0.0096) (0.0095) (0.0094)
marketRURAL_VERMONT 0.0845*** 0.0732*** 0.0712***
(0.0195) (0.0195) (0.0193)
marketRURAL_VIRGINIA 0.0237 0.0218 0.0164
(0.0170) (0.0170) (0.0168)
marketRURAL_WASHINGTON 0.1003*** 0.0990*** 0.1112***
(0.0139) (0.0138) (0.0138)
marketRURAL_WEST_VIRGINIA -0.0399*** -0.0423*** -0.0527***
(0.0149) (0.0148) (0.0147)
marketRURAL_WISCONSIN 0.0446*** 0.0423*** 0.0403***
(0.0100) (0.0100) (0.0099)
marketRURAL_WYOMING 0.1915*** 0.1900*** 0.1833***
(0.0317) (0.0316) (0.0313)
marketSACRAMENTO 0.0360*** 0.0349*** 0.0395***
(0.0105) (0.0105) (0.0104)
marketSALT_LAKE_CITY 0.1050*** 0.0986*** 0.0954***
(0.0142) (0.0142) (0.0141)
marketSAN_ANTONIO 0.1390*** 0.1342*** 0.1295***
(0.0092) (0.0091) (0.0091)
marketSAN_DIEGO 0.0155 0.0146 0.0181
(0.0117) (0.0116) (0.0115)
marketSAN_FRANCISCO 0.0713*** 0.0685*** 0.0741***
(0.0107) (0.0107) (0.0106)
marketSEATTLE 0.1190*** 0.1103*** 0.1206***
(0.0107) (0.0106) (0.0105)
marketST_LOUIS 0.0486*** 0.0449*** 0.0460***
(0.0099) (0.0099) (0.0098)
marketSURBURBAN_NJ -0.0162 -0.0186 -0.0315**
(0.0131) (0.0131) (0.0130)
marketSURBURBAN_NY 0.1060*** 0.1031*** 0.1011***
(0.0126) (0.0126) (0.0124)
marketSYRACUSE -0.0340** -0.0386*** -0.0491***
(0.0146) (0.0146) (0.0145)
marketTAMPA 0.1103*** 0.1071*** 0.1069***
(0.0090) (0.0090) (0.0089)
marketURBAN_NY 0.1615*** 0.1616*** 0.1597***
(0.0113) (0.0112) (0.0111)
marketWASHINGTON_DC 0.1032*** 0.0974*** 0.0905***
(0.0108) (0.0107) (0.0106)
log(beer_floz):brandBUSCH_LIGHT -0.0077* -0.0132***
(0.0040) (0.0042)
log(beer_floz):brandCOORS_LIGHT 0.0006 0.0001
(0.0034) (0.0036)
log(beer_floz):brandMILLER_LITE -0.0151*** -0.0211***
(0.0031) (0.0033)
log(beer_floz):brandNATURAL_LIGHT 0.0542*** 0.0416***
(0.0032) (0.0035)
log(beer_floz):promo 0.0061
(0.0067)
brandBUSCH_LIGHT:promo -0.1490**
(0.0696)
brandCOORS_LIGHT:promo -0.1289**
(0.0571)
brandMILLER_LITE:promo -0.2613***
(0.0513)
brandNATURAL_LIGHT:promo -0.3153***
(0.0507)
log(beer_floz):brandBUSCH_LIGHT:promo 0.0280**
(0.0123)
log(beer_floz):brandCOORS_LIGHT:promo 0.0180*
(0.0103)
log(beer_floz):brandMILLER_LITE:promo 0.0472***
(0.0093)
log(beer_floz):brandNATURAL_LIGHT:promo 0.0553***
(0.0092)
Constant -2.1776*** -2.1388*** -2.1491***
(0.0106) (0.0141) (0.0145)
Observations 48,314 48,314 48,314
R2 0.5529 0.5575 0.5655
Adjusted R2 0.5520 0.5566 0.5645
Residual Std. Error 0.1679 (df = 48214) 0.1670 (df = 48210) 0.1655 (df = 48200)
F Statistic 602.2622*** (df = 99; 48214) 589.7865*** (df = 103; 48210) 555.2074*** (df = 113; 48200)
Note: p<0.1; p<0.05; p<0.01
# stargazer(m1, m2, m3, 
          # type = "text")  # to see a regression table just like in R Console
# Prediction and conversion
beer_test <- beer_test |> 
  mutate(pred_1 = predict(m1, newdata = beer_test),
         pred_2 = predict(m2, newdata = beer_test),
         pred_3 = predict(m3, newdata = beer_test)
         ) |> 
  mutate(price_floz_pred_1 = exp(pred_1),
         price_floz_pred_2 = exp(pred_2),
         price_floz_pred_3 = exp(pred_3)
         )
# RMSE

RMSEs <- beer_test |> 
  mutate(
    resid_1 = log(price_floz) - pred_1,
    resid_2 = log(price_floz) - pred_2,
    resid_3 = log(price_floz) - pred_3
  ) |> 
  mutate(
    resid_1_sq = resid_1^2,
    resid_2_sq = resid_2^2,
    resid_3_sq = resid_3^2
  ) |> 
  summarize(
    rmse_1 = sqrt( mean(resid_1_sq) ),
    rmse_2 = sqrt( mean(resid_2_sq) ),
    rmse_3 = sqrt( mean(resid_3_sq) )
  )

RMSEs |> 
  rmarkdown::paged_table()
# RMSE with the same unit of price_floz

RMSEs_converted <- beer_test |> 
  mutate(
    resid_1 = price_floz - exp(pred_1),
    resid_2 = price_floz - exp(pred_2),
    resid_3 = price_floz - exp(pred_3),
  ) |> 
  mutate(
    resid_1_sq = resid_1^2,
    resid_2_sq = resid_2^2,
    resid_3_sq = resid_3^2
  ) |> 
  summarize(
    rmse_1_converted = sqrt( mean(resid_1_sq) ),
    rmse_2_converted = sqrt( mean(resid_2_sq) ),
    rmse_3_converted = sqrt( mean(resid_3_sq) )
  ) 

RMSEs_converted |> 
  rmarkdown::paged_table()



Question 5 — Interpret market effects (Model 3)

Using Model 3, interpret the estimated coefficients for the following market indicators:

  1. market_ALBANY
  2. market_EXURBAN_NY
  3. market_RURAL_NEW_YORK
  4. market_SURBURBAN_NY
  5. market_SYRACUSE
  6. market_URBAN_NY

Requirements:

  • Interpret each coefficient relative to the reference market (BUFFALO-ROCHESTER), holding other variables fixed.
  • Since the outcome is in logs, convert each coefficient to an approximate percent difference using:
    • Approximation: \(100\times \beta\) (when \(|\beta|\) is small)
    • Exact: \(100\times (\exp(\beta)-1)\)

Use whichever you prefer, but be consistent.

m3_betas <- tidy(m3, conf.int = T) |> 
  filter(
    term %in% c("marketALBANY", "marketEXURBAN_NY", 
                "marketRURAL_NEW_YORK", "marketSURBURBAN_NY",
                "marketSYRACUSE", "marketURBAN_NY")
  ) |> 
  mutate(
    stars = case_when(
      p.value < .01 ~ "***",  # smallest p-values first
      p.value < .05 ~ "**",
      p.value < .1 ~ "*",
      TRUE ~ ""
    ),
    estimate_exp_1 = exp(estimate) - 1,
    .after = estimate
  ) |> 
  mutate(estimate = round(estimate, 3),
         estimate_exp_1 = round(estimate_exp_1, 3)
         )

m3_betas |> 
  relocate(stars, .after = estimate_exp_1) |> 
  rmarkdown::paged_table()
  • Below provides a list of beta esimates and exponential function of those beta estimates from Model 3:
    • market_ALBANY (\(\hat{\beta}=0.024^*\), \(e^{\hat{\beta}}=1.024\))
      Ceteris paribus, beer prices in Albany are 2.4% higher than in the Buffalo–Rochester market.

    • market_EXURBAN_NY (\(\hat{\beta}=0.102^{***}\), \(e^{\hat{\beta}}=1.108\))
      Ceteris paribus, beer prices in ExurbanNY are 10.8% higher than in the Buffalo–Rochester market.

    • market_RURAL_NEW_YORK (\(\hat{\beta}=-0.040\), \(e^{\hat{\beta}}=0.961\))
      Ceteris paribus, beer prices in RuralNY are not statistically different than in the Buffalo–Rochester market.

    • market_SUBURBAN_NY (\(\hat{\beta}=0.101^{***}\), \(e^{\hat{\beta}}=1.106\))
      Ceteris paribus, beer prices in SuburbanNY are 10.6% higher than in the Buffalo–Rochester market.

    • market_SYRACUSE (\(\hat{\beta}=-0.049^{***}\), \(e^{\hat{\beta}}=0.952\))
      Ceteris paribus, beer prices in Syracuse are 4.8% lower than in the Buffalo–Rochester market.

    • market_URBAN_NY (\(\hat{\beta}=0.160^{***}\), \(e^{\hat{\beta}}=1.173\))
      Ceteris paribus, beer prices in UrbanNY are 17.3% higher than in the Buffalo–Rochester market.

All CIs
m3_betas_ci95 <- tidy(m3, conf.int = T) |> 
  mutate(ci = "95%")
m3_betas_ci90 <- tidy(m3, conf.int = T, conf.level = .9) |> 
  mutate(ci = "90%")
m3_betas_ci99 <- tidy(m3, conf.int = T, conf.level = .99) |> 
  mutate(ci = "99%")

m3_betas <- bind_rows(
  m3_betas_ci95,
  m3_betas_ci90,
  m3_betas_ci99
) |> 
  filter(
    term %in% c("marketALBANY", "marketEXURBAN_NY", 
                "marketRURAL_NEW_YORK", "marketSURBURBAN_NY",
                "marketSYRACUSE", "marketURBAN_NY")
  ) |> 
  mutate(
    stars = case_when(
      p.value < .01 ~ "***",  # smallest p-values first
      p.value < .05 ~ "**",
      p.value < .1 ~ "*",
      TRUE ~ ""
    ),
    estimate_exp_1 = exp(estimate)-1,
    .after = estimate
  ) |> 
  mutate(estimate = round(estimate, 3),
         estimate_exp_1 = round(estimate_exp_1, 3)
         )

m3_betas |> 
  relocate(stars, .after = estimate_exp_1) |> 
  rmarkdown::paged_table()
Coefficient plot
ggplot(
  m3_betas,
  aes(
    y = term,
    x = exp(estimate) - 1,
    xmin = exp(conf.low) - 1,
    xmax = exp(conf.high) - 1,
    color = ci
  )
) +
  geom_vline(xintercept = 0, color = "maroon", linetype = 2) +
  geom_pointrange(
    position = position_dodge(width = 0.6)
  ) +
  labs(
    title = "Market effects with multiple confidence intervals",
    color = "CI level",
    y = ""
  ) 

Confidence Interval and True Beta

  • The true value of \(\beta\) is 90% likely to be in the 90% confidence interval.
  • The true value of \(\beta\) is 95% likely to be in the 95% confidence interval.
  • The true value of \(\beta\) is 99% likely to be in the 99% confidence interval.



Question 6 — Volume sensitivity and promo (elasticities)

Across the three models, analyze how sensitive price is to volume, and how that differs by brand and promo:

  1. In Model 1, explain what the coefficient on \(\log(\text{beer\_floz})\) means.
  2. In Model 2, show how the volume sensitivity differs by brand.
  3. In Model 3, explain how promo changes the sensitivity, and how the effect can differ by brand.

Requirements:

  • Write the expression for the slope (marginal effect) of \(\log(\text{beer\_floz})\) for:
    • a particular brand when promo = FALSE
    • the same brand when promo = TRUE
  • Provide at least one numeric example using estimated coefficients from your Model 3 output (pick any one brand).
  • We should focus on the beta esimates for predictors with \(\log(beer floz)\) to calculate the volume elasticity of beer price across brands.
Predictor Model 1 Model 2 Model 3
log_beer_floz -0.1412*** -0.1481*** -0.1442***
brand_BUSCH_LIGHT_*_log_beer_floz -0.0077* -0.0132***
brand_COORS_LIGHT_*_log_beer_floz -0.0006 -0.0001
brand_MILLER_LITE_*_log_beer_floz -0.0151*** -0.0211***
brand_NATURAL_LIGHT_*_log_beer_floz 0.0542*** 0.0416***
promo_True_*_log_beer_floz 0.0061
brand_BUSCH_LIGHT_promo_True_log_beer_floz 0.0280**
brand_COORS_LIGHT_promo_True_log_beer_floz 0.0180*
brand_MILLER_LITE_promo_True_log_beer_floz 0.0472***
brand_NATURAL_LIGHT_promo_True_log_beer_floz 0.0553***
Model 1 Model 2 Model 3 (no Promo) Model 3 (with Promo)
BUD -0.1412 -0.1481 -0.1442 -0.1442
= -0.1442 − 0
BUSCH -0.1412 -0.1558
= -0.1481 − 0.0077
-0.1574
= -0.1442 - 0.0132
-0.1294
= -0.1442 − 0.0132 − 0 + 0.0280
COORS -0.1412 -0.1481
= -0.1481 − 0
-0.1442
= -0.1442 - 0
-0.1262
= -0.1442 - 0 - 0 + 0.0180
MILLER -0.1412 -0.1632
= -0.1481 − 0.0151
-0.1653
= -0.1442 - 0.0211
-0.1181
= -0.1442 - 0.0211 - 0 + 0.0472
NATURAL -0.1412 -0.0939
= -0.1481 + 0.0542
-0.1026
= -0.1442 + 0.0416
-0.0473
= -0.1442 + 0.0416 - 0 + 0.0553
  • We can extract a beta estimate using coef(). For example:
coef(m2)['log(beer_floz)'] + coef(m2)['log(beer_floz):COORS LIGHT']

All else being equal:

  • In Model 1
    • A 1% increase in sales volume (across any of the five brands) is associated with a 0.1412% decrease in price.
  • In Model 2
    • A 1% increase in BUD sales volume is associated with a 0.1481% decrease in its price.
    • A 1% increase in BUSCH sales volume is associated with a 0.1558% decrease in its price.
    • A 1% increase in COORS sales volume is associated with a 0.1481% decrease in its price.
    • A 1% increase in MILLER sales volume is associated with a 0.1632% decrease in its price.
    • A 1% increase in NATURAL sales volume is associated with a 0.0939% decrease in its price.
  • In Model 3 (no Promo)
    • A 1% increase in BUD sales volume is associated with a 0.1442% decrease in its price.
    • A 1% increase in BUSCH sales volume is associated with a 0.1574% decrease in its price.
    • A 1% increase in COORS sales volume is associated with a 0.1442% decrease in its price.
    • A 1% increase in MILLER sales volume is associated with a 0.1653% decrease in its price.
    • A 1% increase in NATURAL sales volume is associated with a 0.1026% decrease in its price.
  • In Model 3 (with Promo)
    • A 1% increase in BUD sales volume is associated with a 0.1442% decrease in its price.
    • A 1% increase in BUSCH sales volume is associated with a 0.1294% decrease in its price.
    • A 1% increase in COORS sales volume is associated with a 0.1262% decrease in its price.
    • A 1% increase in MILLER sales volume is associated with a 0.1181% decrease in its price.
    • A 1% increase in NATURAL sales volume is associated with a 0.0473% decrease in its price.
Code
# --- slopes ---
model1_slope <- -0.1412

model2_slopes <- c(
  Bud = -0.1481,
  Busch = -0.1558,
  Coors = -0.1481,
  Miller = -0.1632,
  Natural = -0.0939
)

model3_full_slopes <- c(
  Bud = -0.1442,
  Busch = -0.1574,
  Coors = -0.1442,
  Miller = -0.1653,
  Natural = -0.1026
)

model3_promo_slopes <- c(
  Bud = -0.1442,
  Busch = -0.1294,
  Coors = -0.1262,
  Miller = -0.1181,
  Natural = -0.0473
)

# --- x range: like np.linspace(0, 10, 100) ---
x_df <- tibble(x = seq(0, 10, length.out = 100))

# --- build plotting data ---
df_model1 <- x_df  |>
  mutate(
    model = "Model 1: Common Elasticity",
    brand = "All Brands",
    y = model1_slope * x
  )

df_model2 <- x_df  |>
  crossing(tibble(brand = names(model2_slopes),
                  slope = as.numeric(model2_slopes)))  |>
  mutate(
    model = "Model 2: Brand-Specific",
    y = slope * x
  )

df_model3_full <- x_df  |>
  crossing(tibble(brand = names(model3_full_slopes),
                  slope = as.numeric(model3_full_slopes)))  |>
  mutate(
    model = "Model 3: Full-Price",
    y = slope * x
  )

df_model3_promo <- x_df  |>
  crossing(tibble(brand = names(model3_promo_slopes),
                  slope = as.numeric(model3_promo_slopes)))  |>
  mutate(
    model = "Model 3: Promotional Price",
    y = slope * x
  )

plot_df <- bind_rows(df_model1, df_model2, df_model3_full, df_model3_promo)

# --- plot ---
ggplot(plot_df, aes(x = x, y = y, color = brand)) +
  geom_line(linewidth = 1) +
  facet_wrap(~ model, nrow = 2) +
  labs(
    x = "log(sales)",
    y = "log(price)",
    color = NULL
  ) +
  theme_minimal() +
  theme(legend.position = "top")

Across all three models, larger package sizes are consistently associated with lower price per fl oz, confirming a standard quantity-discount pattern. The estimated elasticities are all small and negative, ranging roughly from −0.05% to −0.17% per 1% increase in volume.

  • In Model 1, a single elasticity of −0.1412% applies to all brands.

  • Model 2 reveals meaningful brand-level differences: NATURAL has the smallest discount (−0.0939%), while MILLER has the largest (−0.1632%), with BUD and COORS tied in the middle (−0.1481%).

  • Model 3 (no promo) tells a broadly similar story, though the estimates shift slightly.

  • The more striking finding appears in Model 3 under promotion: for most brands, the quantity discount becomes noticeably smaller when an item is on promotion. NATURAL drops from −0.1026% to −0.0473%, MILLER from −0.1653% to −0.1181%, and COORS from −0.1442% to −0.1262%.

    • BUD is the exception, with its elasticity unchanged at −0.1442% regardless of promotional status.
    • This suggests that promotions compress the size-price relationship for most brands, but BUD’s per-unit pricing scales with volume in the same way whether or not it is on promotion.



Question 7 — Residual diagnostics

For each model (1–3), create a residual plot:

  • x-axis: fitted values
  • y-axis: residuals
  • include a horizontal line at 0

Then answer:

  • Are residuals centered around 0 on average?
  • Do you see systematic patterns (curvature, fan shape / heteroskedasticity, clusters, outliers)?
  • Which model appears to have the best residual behavior?
m1 |> 
  augment(newdata = beer_test) |> 
  ggplot(aes(x = .fitted,
             y = .resid)) +
  geom_point(alpha = 0.25) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_smooth(se = FALSE, method = "loess") +
  labs(
    x = "Fitted values",
    y = "Residuals",
    title = "Model 1: Residual Plot for Test Data"
  )

m2 |> 
  augment(newdata = beer_test) |> 
  ggplot(aes(x = .fitted,
             y = .resid)) +
  geom_point(alpha = 0.25) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_smooth(se = FALSE, method = "loess") +
  labs(
    x = "Fitted values",
    y = "Residuals",
    title = "Model 2: Residual Plot for Test Data"
  )

m3 |> 
  augment(newdata = beer_test) |> 
  ggplot(aes(x = .fitted,
             y = .resid)) +
  geom_point(alpha = 0.25) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_smooth(se = FALSE, method = "loess") +
  labs(
    x = "Fitted values",
    y = "Residuals",
    title = "Model 3: Residual Plot for Test Data"
  )

  • Are the residuals centered around 0 (on average)?
    • Yes. The residuals appear to be roughly balanced above and below 0, with no clear shift away from zero.
  • Do you see systematic patterns (curvature, fan shape / heteroskedasticity, clusters, outliers)?
    • No. I don’t see obvious curvature or a fan-shaped spread, and there are no clear clusters or extreme outliers that stand out.
      • Model 1 has a wider range of fitted values than Models 2 and 3, but this is minor.
  • Which model appears to have the best residual behavior?
    • The residual plots look very similar across models.
    • None clearly dominates based on residual behavior alone, so they appear comparably good on this diagnostic.



Question 8 — Choose a preferred model

Which model do you prefer most and why? Your answer must reference at least two of the following:

  • out-of-sample performance (test RMSE)
  • interpretability / simplicity
  • whether interactions are substantively meaningful
  • residual plot patterns

I prefer Model 3 for three simple reasons:

  1. Realistic Market Setting
  • It captures both the unique sensitivity of each brand and how that sensitivity changes when a beer is on promotion.
  1. Practical Pricing Strategies
  • By distinguishing full-price from promotional periods, it tells you exactly how much to adjust each brand’s price under each scenario.
  1. Better Fit with the lowest MSE on test data
  • Allowing elasticities to vary by brand and promotion status typically explains sales patterns more accurately than the cruder alternatives.

In short, Model 3 reflects realistic market settings with brand heterogeneity and promotion effect, along with the best prediction quality.



Part 2. Quarto Blogging


Use the following data frame for Quarto Blogging:

#| warning: false

library(tidyverse)
ice_cream <- read_csv("https://bcdanl.github.io/data/ben-and-jerry-cleaned.csv")


Variable Description

Variable Name Type Description Categories / Notes
priceper1 Numeric The unit price for one product serving (in dollars).
flavor_descr Categorical Description of the ice cream flavor.
size1_descr Categorical Package/serving size description.
household_id ID Household identifier.
household_income Numeric-coded Household income bracket code. e.g., 60000 indicates an income range
household_size Integer Number of persons in the household.
usecoup Logical Coupon used (TRUE/FALSE).
couponper1 Numeric Coupon discount per unit. often 0 if no coupon
region Categorical Region (East/Central/West/South).
married Logical Married (TRUE/FALSE).
race Categorical Race category.
hispanic_origin Logical Hispanic origin (TRUE/FALSE).
microwave Logical Owns microwave (TRUE/FALSE).
dishwasher Logical Owns dishwasher (TRUE/FALSE).
sfh Logical Single-family home (TRUE/FALSE).
internet Logical Has internet (TRUE/FALSE).
tvcable Logical Subscribes to cable TV (TRUE/FALSE).


Part 2 Task — Write a blog post (Quarto Document)

Write a blog post about Ben & Jerry’s ice cream using the ice_cream data and publish it on your course website blog.

Requirements (minimum):

  1. Create a post using a Quarto Document (.qmd) and ensure it appears on your Quarto website blog.

  2. Include:

    • a short introduction (what the dataset is and what you’ll explore)
    • descriptive statistics (at least 3 summary numbers)
    • counting (e.g., top flavors)
    • filtering (at least one meaningful filter)
    • grouped summaries (at least one group_by() + summarise())
    • one regression model (your choice; explain the outcome and predictors)
  3. Optional but encouraged:

    • at least one plot (e.g., ggplot boxplot by region, price distribution, etc.)

Deliverables:

  • The blog post (in your website repo) renders successfully
  • The post is accessible from your website’s blog listing page
  • In your Brightspace submission (Part 1 QMD), include the URL to your published blog post
Back to top