library(tidyverse)
library(broom)
library(stargazer)
library(glmnet)
library(gamlr)
library(Matrix)
library(janitor)
library(rpart)
library(rpart.plot)
library(ranger)
# library(xgboost)
library(vip)
library(pdp)
library(DT)
library(rmarkdown)
library(hrbrthemes)
library(ggthemes)
theme_set(
theme_ipsum()
)
scale_colour_discrete <- function(...) scale_color_colorblind(...)
scale_fill_discrete <- function(...) scale_fill_colorblind(...)Homework 4
Predicting Housing Price in California
Direction
Please submit one Quarto Document of Homework 4 to Brightspace using the following file naming convention:
Example:
danl-320-hw4-choe-byeonghak.qmd
Due: April 20, 2026, 11:59 P.M. (ET)
Please send Byeong-Hak an email (
bchoe@geneseo.edu) if you have any questions.
Data
df <- read_csv('https://bcdanl.github.io/data/california_housing_cleaned.csv')- The housing data at census tract-level in California include:
- Latitude/Longitude of tract centers
- Median Home age.
- Median Income
- Average room/bedroom numbers
- Average occupancy
- Median home values
- The goal is to predict the log of median housing value for census tracts.
Question 1
- Divide the
dfDataFrame into training and test DataFrames.- Use
dtrainanddtestfor training and test DataFrames, respectively. - 70% of observations in the
dfare assigned todtrain; the rest is assigned todtest.
- Use
Question 2
- Consider the following model:
\[ \begin{align} \log(\text{medianHouseValue})_{i} = &\beta_0 + \beta_1 \text{housingMedianAge}_{i} + \beta_2 \text{medianIncome}_{i}\\ &+ \beta_3 \text{AveBedrms}_{i} + \beta_4 \text{AveRooms}_{i} + \beta_5 \text{AveOccupancy}_{i}\\ &+ \beta_{6}\text{longitude}_{i} + \beta_{7}\text{latitude}_{i} + \epsilon_{i} \end{align} \]
- Provide a rationale behind the model why it includes \(\text{longitude}\) and \(\text{longitude}\) as predictors for \(\log(\text{medianHouseValue})\)
Question 3
- Fit the linear regression model in Question 2.
Question 4
- Interpret \(\hat{\beta_{2}}\), \(\hat{\beta_{4}}\), and \(\hat{\beta_{5}}\), obtained from the linear regression model in Question 3.
Questions 5-6
- Suppose the model allow for the relationship between the outcome and each of these predictorsโ\(\text{housingMedianAge}_{i}\), \(\text{medianIncome}_{i}\), \(\text{AveBedrms}_{i}\), \(\text{AveRooms}_{i}\), and \(\text{AveOccupancy}_{i}\)โvaries by location of Census tract.
\[ \begin{align} \log(\text{medianHouseValue})_{i} = &\beta_0 + \beta_1 \text{housingMedianAge}_{i} + \beta_2 \text{medianIncome}_{i}\\ &+ \beta_3 \text{AveBedrms}_{i} + \beta_4 \text{AveRooms}_{i} + \beta_5 \text{AveOccupancy}_{i}\\ &+ \beta_6 \text{longitude}_{i} + \beta_7 \text{latitude}_{i}\\ &+ (\beta_8 \text{housingMedianAge}_{i} + \beta_9 \text{medianIncome}_{i}\\ &\qquad+ \beta_{10} \text{AveBedrms}_{i} + \beta_{11} \text{AveRooms}_{i} + \beta_{12} \text{AveOccupancy}_{i} ) \times \text{longitude}_{i}\\ &+ (\beta_{13} \text{housingMedianAge}_{i} + \beta_{14} \text{medianIncome}_{i}\\ &\qquad+ \beta_{15} \text{AveBedrms}_{i} + \beta_{16} \text{AveRooms}_{i} + \beta_{17} \text{AveOccupancy}_{i} ) \times \text{latitude}_{i}\\ &+ (\beta_{18} \text{housingMedianAge}_{i} + \beta_{19} \text{medianIncome}_{i} \\ &\qquad+ \beta_{20} \text{AveBedrms}_{i} + \beta_{21} \text{AveRooms}_{i} + \beta_{22} \text{AveOccupancy}_{i} )\times \text{longitude}_{i} \times \text{latitude}_{i}\\ &+ \epsilon_{i} \end{align} \]
Question 5
Fit a Lasso linear regression model.
Question 6
Fit a Ridge linear regression model.
Questions 7-10
Consider the tree-based model described below:
\[ \begin{align} \log(\text{medianHouseValue})_{i} = f(&\text{housingMedianAge}_{i}, \text{medianIncome}_{i},\\ &\text{AveBedrms}_{i}, \text{AveRooms}_{i}, \text{AveOccupancy}_{i},\\ &\text{latitude}_{i}, \text{longitude}_{i}) \end{align} \]
Question 7
Fit a regression tree model.
Question 8
Fit a random forest model with num.trees = 500.
Question 9
Fit a gradient-boosted tree model using the following hyperparameter grid:
xgb_grid <- tidyr::crossing(
nrounds = seq(20, 200, by = 20),
eta = c(0.025, 0.05, 0.1, 0.3),
max_depth = c(2, 3),
gamma = c(0),
colsample_bytree = c(1),
min_child_weight = c(1),
subsample = c(1)
)
xgb_grid |>
paged_table()Question 10
Compare the prediction performance across the models:
- Linear regression
- Lasso regression
- Ridge regression
- Regression tree
- Random forest
- Gradient-boosted tree