Homework 2

Linear Regression; Jupyter Notebook Blogging

Author

Byeong-Hak Choe

Published

March 13, 2025

Modified

March 13, 2025

Direction

  • Please submit your Jupyter Notebook for Part 1 in Homework 2 to Brightspace with the name below:

    • danl-320-hw2-LASTNAME-FIRSTNAME.ipynb
      ( e.g., danl-320-hw2-choe-byeonghak.ipynb )
  • The due is March 24, 2025, 3:15 P.M.

    • It is recommended to finish it before the Spring Break begins.
  • Please send Byeong-Hak an email (bchoe@geneseo.edu) if you have any questions.




Part 1. Linear Regression

Consider the beer_markets

beer_markets = pd.read_csv(
  'https://bcdanl.github.io/data/beer_markets_all_cleaned.csv'
)


Variable Description

Variable Name Description
household Unique identifier for household
X_purchase_desc Description of beer purchase
quantity Number of beer packages purchased
brand Brand of beer purchased
dollar_spent Total amount spent on the purchase
beer_floz Total volume of beer purchased (in fluid ounces)
price_floz Price per fluid ounce of beer
container Type of beer container (e.g., CAN, BOTTLE)
promo Indicates if the purchase was part of a promotion (True/False)
region Geographic region of purchase
marital Marital status of household head
income Income level of the household
age Age group of household head
employment Employment status of household head
degree Education level of household head
occupation Occupation category of household head
ethnic Ethnicity of household head
microwave Indicates if the household owns a microwave (True/False)
dishwasher Indicates if the household owns a dishwasher (True/False)
tvcable Type of television subscription (e.g., basic, premium)
singlefamilyhome Indicates if the household is a single-family home (True/False)
npeople Number of people in the household




Question 1

Create the DataFrame that keeps all the observations whose value of container is either ‘CAN’ or ‘NON REFILLABLE BOTTLE’ in the beer_markets DataFrame.


Question 2

Split the resulting DataFrame of Question 1 into training and test DataFrames such that approximately 67% of observations in the resulting DataFrame of Question 1 belong to the training DataFrame and the rest observations belong to the test DataFrame.


Questions 3-8

Consider the following three linear regression models:

Model 1

\[ \begin{aligned} \log(\text{price\_per\_floz}) = &\ \beta_{0} + \sum_{i=1}^{N} \beta_{i} \,\text{market}_{i} + \sum_{j=N+1}^{N+4} \beta_{j} \,\text{brand}_{j} + \beta_{N+5} \,\text{container\_CAN} \\ &\,+\, \beta_{N+6} \log(\text{beer\_floz}) + \epsilon \end{aligned} \]

Model 2

\[ \begin{aligned} \log(\text{price\_per\_floz}) \,=\, & \beta_{0} \,+\, \sum_{i=1}^{N}\beta_{i}\,\text{market}_{i} \,+\, \sum_{j=N+1}^{N+4}\beta_{j}\,\text{brand}_{j} \,+\, \beta_{N+5}\,\text{container\_CAN} \\ &\,+\, \beta_{N+6}\log(\text{beer\_floz})\\ &\,+\, \sum_{j=N+1}^{N+4}\beta_{j\times\text{beer\_floz}}\,\text{brand}_{j}\times \log(\text{beer\_floz})\\ &\,+\, \epsilon \end{aligned} \]

Model 3

\[ \begin{aligned} \log(\text{price\_per\_floz}) \,=\, & \beta_{0} \,+\, \sum_{i=1}^{N}\beta_{i}\,\text{market}_{i} \,+\, \sum_{j=N+1}^{N+4}\beta_{j}\,\text{brand}_{j} \,+\, \beta_{N+5}\,\text{container\_CAN} \\ &\,+\, \beta_{N+6}\log(\text{beer\_floz})\\ &\,+\, \beta_{N+7}\,\text{promo} \times\log(\text{beer\_floz}) \\ &\,+\, \sum_{j=N+1}^{N+4}\beta_{j\times\text{beer\_floz}}\,\text{brand}_{j}\times \log(\text{beer\_floz})\\ &\,+\, \sum_{j=N+1}^{N+4}\beta_{j\times\text{promo}}\,\text{brand}_{j}\times \text{promo}\\ &\,+\, \sum_{j=N+1}^{N+4}\beta_{j\times\text{promo}\times\text{beer\_floz}}\,\text{brand}_{j}\times \text{promo}\times \log(\text{beer\_floz})\\ &\,+\, \epsilon \end{aligned} \]

  • Set “BUD_LIGHT” as the reference level for the \(\text{brand}\) variable.
  • Set “BUFFALO-ROCHESTER” as the reference level for the \(\text{market}\) variable..
    • There are \(N+1\) distinct categories in the \(\text{market}\) variable.



Question 3

Provide intuition behind each of the model.



Question 4

  • Train each of the three linear regression models using the training DataFrame from Question 2.
    • Provide the summary of the result for each linear regression model.
    • What are the predicted beer prices for unseen data from each model?



Question 5

  • Interpret the beta estimates of the following variables from the Model 3:
    1. market_ALBANY
    2. market_EXURBAN_NY
    3. market_RURAL_NEW_YORK
    4. market_SURBURBAN_NY
    5. market_SYRACUSE
    6. market_URBAN_NY



Question 6

  • Across the three models, how is the percentage change in the price of beer sensitive to the percentage change in the volume of beer purchases for each brand?

  • How does promo affect such sensitivity in the Model 3?



Question 7

  • Draw a residual plot from each of the three models.
    • On average, are the prediction correct? Are there systematic errors?



Question 8

Which model do you prefer most and why?



Part 2. Jupyter Notebook Blogging

Use the following data.frame for Jupyter Notebook Blogging:



ice_cream <- read_csv('https://bcdanl.github.io/data/ben-and-jerry-cleaned.csv')


Variable Description

Variable Name Type Description Categories / Notes
priceper1 Numeric The unit price for one product serving (in dollars).
flavor_descr Categorical Description of the ice cream flavor.
size1_descr Categorical The description of the package or serving size (in fluid ounces or a related measure).
household_id Categorical/ID Unique identifier for each household purchasing product(s).
household_income Categorical/Numeric The household income bracket (in dollars). The number (e.g., 60,000) corresponds to predefined income categories. (for instance, “60,000” represents a particular income range from $60,000 to $70,000). Not a categorical variable per se, but can be grouped into categories if desired.
household_size Categorical The number of persons in the household. Discrete numeric count (e.g., 1, 2, 3…). Not a categorical variable per se, but can be grouped into categories if desired.
usecoup Categorical Indicates whether a coupon was used in the purchase. Boolean category: True (coupon used) or False (coupon not used).
couponper1 Numeric The discount amount per unit applied when a coupon is used. Continuous numeric value; often zero if no coupon is used, and a positive number when a discount is applied.
region Categorical Geographic region where the household is located. Categories include “East”, “Central”, “West”, and “South”.
married Categorical Marital status of the household head (or the household overall status). Boolean category: True (married) or False (not married).
race Categorical Race of the household head or primary respondent. Categories include “white”, “black”, “asian”, and “other”.
hispanic_origin Categorical Indicates whether the household identifies as of Hispanic origin. Boolean category: True (Hispanic origin) or False (not Hispanic).
microwave Categorical Whether the household owns a microwave. Boolean category: True (owns microwave) or False (does not own one).
dishwasher Categorical Whether the household owns a dishwasher. Boolean category: True (owns dishwasher) or False (does not own one).
sfh Categorical Whether the household resides in a single-family home. Boolean category: True (single-family home) or False (does not live in a single-family home).
internet Categorical Whether the household has internet service. Boolean category: True (has internet) or False (does not).
tvcable Categorical Indicates whether the household subscribes to cable television service. Boolean category: True (has cable TV) or False (does not).


  • Write a blog post about Ben and Jerry’s ice cream in the ice_cream DataFrame using Jupyter Notebook, and add it to your online blog.
    • In your blog post, provide your linear regression model, as well as descriptive statistics, counting, filtering, and various group operations.
    • Optionally, provide seaborn visualizations.


Back to top