Homework 4

Lasso Linear Regression; Tree-based Models

Author

Byeong-Hak Choe

Published

May 4, 2025

Modified

May 4, 2025

Direction

Please submit four Jupyter Notebooks for Part 1 and Part 2 of Homework 4 to Brightspace using the following file naming convention:
- Part 1 - Model 1
  - danl-320-hw4-LASTNAME-FIRSTNAME-pt-1-model-1.ipynb
- Part 1 - Model 2
  - danl-320-hw4-LASTNAME-FIRSTNAME-pt-1-model-2.ipynb
- Part 1 - Model 3
  - danl-320-hw4-LASTNAME-FIRSTNAME-pt-1-model-3.ipynb
- Part 2
  - danl-320-hw4-LASTNAME-FIRSTNAME-pt-2.ipynb
- Examples:
  - danl-320-hw4-choe-byeonghak-pt-1-model-1.ipynb
  - danl-320-hw4-choe-byeonghak-pt-1-model-2.ipynb
  - danl-320-hw4-choe-byeonghak-pt-1-model-3.ipynb
  - danl-320-hw4-choe-byeonghak-pt-2.ipynb
The due is April 19, 2025, noon.
Please send Byeong-Hak an email (bchoe@geneseo.edu) if you have any questions.

Descriptive Statistics

The following provides the descriptive statistics for each part of Homework 4:

Part 1. Lasso Linear Regression

Consider the beer_markets DataFrame from Homework 2:

beer_markets = pd.read_csv(
  'https://bcdanl.github.io/data/beer_markets_all_cleaned.csv'
)

Variable Description

Variable Name	Description
`household`	Unique identifier for household
`X_purchase_desc`	Description of beer purchase
`quantity`	Number of beer packages purchased
`brand`	Brand of beer purchased
`dollar_spent`	Total amount spent on the purchase
`beer_floz`	Total volume of beer purchased (in fluid ounces)
`price_floz`	Price per fluid ounce of beer
`container`	Type of beer container (e.g., CAN, BOTTLE)
`promo`	Indicates if the purchase was part of a promotion (True/False)
`market`	Market region of purchase
`marital`	Marital status of household head
`income`	Income level of the household
`age`	Age group of household head
`employment`	Employment status of household head
`degree`	Education level of household head
`occupation`	Occupation category of household head
`ethnic`	Ethnicity of household head
`microwave`	Indicates if the household owns a microwave (True/False)
`dishwasher`	Indicates if the household owns a dishwasher (True/False)
`tvcable`	Type of television subscription (e.g., basic, premium)
`singlefamilyhome`	Indicates if the household is a single-family home (True/False)
`npeople`	Number of people in the household

For this homework, please read only one CSV file at a time due to memory limitations in Google Colab. Loading multiple CSV files simultaneously may cause a free-tier Google Colab instance to crash.

url_1 = "https://bcdanl.github.io/data/beer_markets_xbeer_xdemog.zip"
url_2 = "https://bcdanl.github.io/data/beer_markets_xbeer_brand_xdemog.zip"
url_3 = "https://bcdanl.github.io/data/beer_markets_xbeer_brand_promo_xdemog.zip"

## Model 1
beer = pd.read_csv(url_1)

## Model 2
beer = pd.read_csv(url_2)

## Model 3
beer = pd.read_csv(url_3)

Dataset Details: Each DataFrame specified in url_1, url_2, and url_3 contains 2,638 demographic dummy variables. These include:
1. Individual Demographic Dummies: As described previously.
2. Interaction Terms: Constructed by interacting the market dummies with each of the demographic dummies from the beer_markets DataFrame.

\[ \begin{align} &\text{market}\\ &\text{marital}\\ &\text{age}\\ &\text{employment}\\ &\text{degree}\\ &\text{occupation}\\ &\text{ethnic}\\ &\text{microwave}\\ &\text{dishwasher}\\ &\text{tvcable}\\ &\text{singlefamilyhome}\\ &\text{npeople} \end{align} \]

\[ \begin{align} &\text{market}\times \text{marital}\\ &\text{market}\times \text{income}\\ &\text{market}\times \text{age}\\ &\text{market}\times \text{employment}\\ &\text{market}\times \text{degree}\\ &\text{market}\times \text{occupation}\\ &\text{market}\times \text{ethnic}\\ &\text{market}\times \text{microwave}\\ &\text{market}\times \text{dishwasher}\\ &\text{market}\times \text{tvcable}\\ &\text{market}\times \text{singlefamilyhome}\\ &\text{market}\times \text{npeople} \end{align} \]

Consider including all demographic dummy variables from the beer DataFrame in each of the models evaluated in Homework 2 Questions 3-8.

Model 1

\[ \begin{aligned} \log(\text{price\_per\_floz}) = &\ \beta_{0} + \sum_{j=1}^{4} \beta_{j} \,\text{brand}_{j} + \beta_{5} \,\text{container\_CAN} \\ &\,+\, \beta_{6} \log(\text{beer\_floz})\\ &\,+\, \sum_{k =7}^{2645}\beta_{k}\,\text{demoghics}_{k}\\ &\,+\, \epsilon \end{aligned} \]

Model 2

Model 3

\[ \begin{aligned} \log(\text{price\_per\_floz}) \,=\, & \beta_{0} \,+\, \sum_{j=1}^{4}\beta_{j}\,\text{brand}_{j} \,+\, \beta_{5}\,\text{container\_CAN} \\ &\,+\, \beta_{6}\log(\text{beer\_floz})\\ &\,+\, \beta_{7}\,\text{promo} \times\log(\text{beer\_floz}) \\ &\,+\, \sum_{j=1}^{4}\beta_{j\times\text{beer\_floz}}\,\text{brand}_{j}\times \log(\text{beer\_floz})\\ &\,+\, \sum_{j=1}^{4}\beta_{j\times\text{promo}}\,\text{brand}_{j}\times \text{promo}\\ &\,+\, \sum_{j=1}^{4}\beta_{j\times\text{promo}\times\text{beer\_floz}}\,\text{brand}_{j}\times \text{promo}\times \log(\text{beer\_floz})\\ &\,+\, \sum_{k = 7}^{2645}\beta_{k}\,\text{demoghics}_{k}\\ &\,+\, \epsilon \end{aligned} \]

Please fit one model at a time.
- One Lasso training can take around 3 minutes.

Question 1

Fit a Lasso linear regression model for Models 1, 2, and 3.
Determine the optimal value of the alpha parameter for each model.
Compute and report the Mean Squared Error (MSE) for each model.

Question 2

Across the three models in this homework, how is the percentage change in the price of beer sensitive to the percentage change in the volume of beer purchases for each brand?
How does incorporating a broader demographic design into the model affect this?

Question 3

Using the test dataset, compare the Mean Squared Errors (MSEs) of the models from Homework 2 to those from the current analysis.

Part 2. Tree-based Models

I downloaded MLB 2024 batting statistics leaderboard from Fangraphs, and created the following mlb_battings_2024 DataFrame:

mlb_battings_2024 = pd.read_csv("https://bcdanl.github.io/data/mlb_battings_2024.csv")

Variable Description

Variable	Description
`g`	Games Played: The number of games in which the player appeared.
`pa`	Plate Appearances: Total number of times the player appeared at the plate.
`hr`	Home Runs: Total number of home runs hit by the player.
`r`	Runs: Total number of runs scored by the player.
`rbi`	Runs Batted In (RBI): Number of runs the player batted in.
`sb`	Stolen Bases: Total number of bases stolen by the player.
`bb_percent`	Walk Percentage: The percentage of plate appearances that result in a base on balls.
`k_percent`	Strikeout Percentage: The percentage of plate appearances that end in a strikeout.
`iso`	Isolated Power (ISO): A measure of a player’s raw power, calculated as (SLG - AVG).
`babip`	Batting Average on Balls In Play (BABIP): The average when excluding home runs and strikeouts.
`avg`	Batting Average (AVG): The ratio of hits to official at-bats.
`obp`	On-Base Percentage (OBP): The frequency a player reaches base per plate appearance.
`slg`	Slugging Percentage (SLG): A weighted measure of total bases per at-bat.
`w_oba`	Weighted On-Base Average (wOBA): An advanced metric that measures a player’s overall offensive value.
`xw_oba`	Expected wOBA (xwOBA): A metric estimating wOBA based on the quality of contact.
`w_rc`	Weighted Runs Created (wRC): An advanced statistic that estimates the number of runs a player creates.
`bs_r`	Base Running Runs (BsR): A metric quantifying the value of a player’s base running.
`off`	Offensive Value: A composite metric or rating summarizing the player’s offensive contributions.
`def`	Defensive Value: A composite metric or rating summarizing the player’s defensive contributions.
`war`	Wins Above Replacement (WAR): An overall measure of a player’s total contributions to their team.

Consider the tree-based models in Part 2:
- Outcome Variable: war
- Predictors: All remaining variables

Question 4

Fit a regression tree model with a maximum depth of 3 (max_depth=3).
Provide an interpretation of the leaf nodes.

Question 5

Fit a regression tree model without imposing a maximum depth constraint.

Question 6

Prune regression trees using cross-validation (CV).
Plot the CV error versus the number of leaves.
Plot the pruned tree with the lowest mean CV MSE.
Compare the pruned tree with the tree from Question 4.

Question 7

Fit a random forest model.
Plot the variable importance measures.

Question 8

Fit an XGBoost model.
Plot the variable importance measures.

Question 9

Compare the Mean Squared Errors (MSEs) on the test data across the different tree-based models.
Analyze and discuss the differences in predictive performance among these models.

Part 3. Jupyter Notebook Blogging

Write a blog post about Part 1 of Homework 3 - Housing Markets using Jupyter Notebook, and add it to your online blog.