Homework 2

Linear Regression; Jupyter Notebook Blogging

Author

Byeong-Hak Choe

Published

May 4, 2025

Modified

May 4, 2025

Direction

Please submit your Jupyter Notebook for Part 1 in Homework 2 to Brightspace with the name below:
- danl-320-hw2-LASTNAME-FIRSTNAME.ipynb
  ( e.g., danl-320-hw2-choe-byeonghak.ipynb )
The due is March 24, 2025, 3:15 P.M.
- It is recommended to finish it before the Spring Break begins.
Please send Byeong-Hak an email (bchoe@geneseo.edu) if you have any questions.

Descriptive Statistics

The following provides the descriptive statistics for each part of Homework 2:

Part 1. Linear Regression

Consider the beer_markets

beer_markets = pd.read_csv(
  'https://bcdanl.github.io/data/beer_markets_all_cleaned.csv'
)

Variable Description

Variable Name	Description
`household`	Unique identifier for household
`X_purchase_desc`	Description of beer purchase
`quantity`	Number of beer packages purchased
`brand`	Brand of beer purchased
`dollar_spent`	Total amount spent on the purchase
`beer_floz`	Total volume of beer purchased (in fluid ounces)
`price_floz`	Price per fluid ounce of beer
`container`	Type of beer container (e.g., CAN, BOTTLE)
`promo`	Indicates if the purchase was part of a promotion (True/False)
`region`	Geographic region of purchase
`marital`	Marital status of household head
`income`	Income level of the household
`age`	Age group of household head
`employment`	Employment status of household head
`degree`	Education level of household head
`occupation`	Occupation category of household head
`ethnic`	Ethnicity of household head
`microwave`	Indicates if the household owns a microwave (True/False)
`dishwasher`	Indicates if the household owns a dishwasher (True/False)
`tvcable`	Type of television subscription (e.g., basic, premium)
`singlefamilyhome`	Indicates if the household is a single-family home (True/False)
`npeople`	Number of people in the household

Question 1

Create the DataFrame that keeps all the observations whose value of container is either ‘CAN’ or ‘NON REFILLABLE BOTTLE’ in the beer_markets DataFrame.

Question 2

Split the resulting DataFrame of Question 1 into training and test DataFrames such that approximately 67% of observations in the resulting DataFrame of Question 1 belong to the training DataFrame and the rest observations belong to the test DataFrame.

Questions 3-8

Consider the following three linear regression models:

Model 1

\[ \begin{aligned} \log(\text{price\_per\_floz}) = &\ \beta_{0} + \sum_{i=1}^{N} \beta_{i} \,\text{market}_{i} + \sum_{j=N+1}^{N+4} \beta_{j} \,\text{brand}_{j} + \beta_{N+5} \,\text{container\_CAN} \\ &\,+\, \beta_{N+6} \log(\text{beer\_floz}) + \epsilon \end{aligned} \]

Model 2

Model 3

\[ \begin{aligned} \log(\text{price\_per\_floz}) \,=\, & \beta_{0} \,+\, \sum_{i=1}^{N}\beta_{i}\,\text{market}_{i} \,+\, \sum_{j=N+1}^{N+4}\beta_{j}\,\text{brand}_{j} \,+\, \beta_{N+5}\,\text{container\_CAN} \\ &\,+\, \beta_{N+6}\log(\text{beer\_floz})\\ &\,+\, \beta_{N+7}\,\text{promo} \times\log(\text{beer\_floz}) \\ &\,+\, \sum_{j=N+1}^{N+4}\beta_{j\times\text{beer\_floz}}\,\text{brand}_{j}\times \log(\text{beer\_floz})\\ &\,+\, \sum_{j=N+1}^{N+4}\beta_{j\times\text{promo}}\,\text{brand}_{j}\times \text{promo}\\ &\,+\, \sum_{j=N+1}^{N+4}\beta_{j\times\text{promo}\times\text{beer\_floz}}\,\text{brand}_{j}\times \text{promo}\times \log(\text{beer\_floz})\\ &\,+\, \epsilon \end{aligned} \]

Set “BUD_LIGHT” as the reference level for the $\text{brand}$ variable.
Set “BUFFALO-ROCHESTER” as the reference level for the $\text{market}$ variable..
- There are $N+1$ distinct categories in the $\text{market}$ variable.

Question 3

Provide intuition behind each of the model.

Question 4

Train each of the three linear regression models using the training DataFrame from Question 2.
- Provide the summary of the result for each linear regression model.
- What are the predicted beer prices for unseen data from each model?

Question 5

Interpret the beta estimates of the following variables from the Model 3:
1. market_ALBANY
2. market_EXURBAN_NY
3. market_RURAL_NEW_YORK
4. market_SURBURBAN_NY
5. market_SYRACUSE
6. market_URBAN_NY

Question 6

Across the three models, how is the percentage change in the price of beer sensitive to the percentage change in the volume of beer purchases for each brand?
How does promo affect such sensitivity in the Model 3?

Question 7

Draw a residual plot from each of the three models.
- On average, are the prediction correct? Are there systematic errors?

Question 8

Which model do you prefer most and why?

Part 2. Jupyter Notebook Blogging

Use the following data.frame for Jupyter Notebook Blogging:

ice_cream <- read_csv('https://bcdanl.github.io/data/ben-and-jerry-cleaned.csv')

Variable Description

Variable Name	Type	Description	Categories / Notes
`priceper1`	Numeric	The unit price for one product serving (in dollars).
`flavor_descr`	Categorical	Description of the ice cream flavor.
`size1_descr`	Categorical	The description of the package or serving size (in fluid ounces or a related measure).
`household_id`	Categorical/ID	Unique identifier for each household purchasing product(s).
`household_income`	Categorical/Numeric	The household income bracket (in dollars).	The number (e.g., 60,000) corresponds to predefined income categories. (for instance, “60,000” represents a particular income range from $60,000 to $70,000). Not a categorical variable per se, but can be grouped into categories if desired.
`household_size`	Categorical	The number of persons in the household.	Discrete numeric count (e.g., 1, 2, 3…). Not a categorical variable per se, but can be grouped into categories if desired.
`usecoup`	Categorical	Indicates whether a coupon was used in the purchase.	Boolean category: `True` (coupon used) or `False` (coupon not used).
`couponper1`	Numeric	The discount amount per unit applied when a coupon is used.	Continuous numeric value; often zero if no coupon is used, and a positive number when a discount is applied.
`region`	Categorical	Geographic region where the household is located.	Categories include “East”, “Central”, “West”, and “South”.
`married`	Categorical	Marital status of the household head (or the household overall status).	Boolean category: `True` (married) or `False` (not married).
`race`	Categorical	Race of the household head or primary respondent.	Categories include “white”, “black”, “asian”, and “other”.
`hispanic_origin`	Categorical	Indicates whether the household identifies as of Hispanic origin.	Boolean category: `True` (Hispanic origin) or `False` (not Hispanic).
`microwave`	Categorical	Whether the household owns a microwave.	Boolean category: `True` (owns microwave) or `False` (does not own one).
`dishwasher`	Categorical	Whether the household owns a dishwasher.	Boolean category: `True` (owns dishwasher) or `False` (does not own one).
`sfh`	Categorical	Whether the household resides in a single-family home.	Boolean category: `True` (single-family home) or `False` (does not live in a single-family home).
`internet`	Categorical	Whether the household has internet service.	Boolean category: `True` (has internet) or `False` (does not).
`tvcable`	Categorical	Indicates whether the household subscribes to cable television service.	Boolean category: `True` (has cable TV) or `False` (does not).

Write a blog post about Ben and Jerry’s ice cream in the ice_cream DataFrame using Jupyter Notebook, and add it to your online blog.
- In your blog post, provide your linear regression model, as well as descriptive statistics, counting, filtering, and various group operations.
- Optionally, provide seaborn visualizations.