Homework 4

Predicting Housing Price in California

Author

Byeong-Hak Choe

Published

April 13, 2026

Modified

April 13, 2026

Direction

  • Please submit one Quarto Document of Homework 4 to Brightspace using the following file naming convention:

  • Example:

    • danl-320-hw4-choe-byeonghak.qmd
  • Due: April 20, 2026, 11:59 P.M. (ET)

  • Please send Byeong-Hak an email (bchoe@geneseo.edu) if you have any questions.


library(tidyverse)
library(broom)
library(stargazer)

library(glmnet)
library(gamlr)
library(Matrix)

library(janitor)
library(rpart)
library(rpart.plot)
library(ranger)
# library(xgboost)
library(vip)
library(pdp)

library(DT)
library(rmarkdown)
library(hrbrthemes)
library(ggthemes)

theme_set(
  theme_ipsum()
)

scale_colour_discrete <- function(...) scale_color_colorblind(...)
scale_fill_discrete <- function(...) scale_fill_colorblind(...)

Data

df <- read_csv('https://bcdanl.github.io/data/california_housing_cleaned.csv')


  • The housing data at census tract-level in California include:
    • Latitude/Longitude of tract centers
    • Median Home age.
    • Median Income
    • Average room/bedroom numbers
    • Average occupancy
    • Median home values
  • The goal is to predict the log of median housing value for census tracts.



Question 1

  • Divide the df DataFrame into training and test DataFrames.
    • Use dtrain and dtest for training and test DataFrames, respectively.
    • 70% of observations in the df are assigned to dtrain; the rest is assigned to dtest.



Question 2

  • Consider the following model:

\[ \begin{align} \log(\text{medianHouseValue})_{i} = &\beta_0 + \beta_1 \text{housingMedianAge}_{i} + \beta_2 \text{medianIncome}_{i}\\ &+ \beta_3 \text{AveBedrms}_{i} + \beta_4 \text{AveRooms}_{i} + \beta_5 \text{AveOccupancy}_{i}\\ &+ \beta_{6}\text{longitude}_{i} + \beta_{7}\text{latitude}_{i} + \epsilon_{i} \end{align} \]

  • Provide a rationale behind the model why it includes \(\text{longitude}\) and \(\text{longitude}\) as predictors for \(\log(\text{medianHouseValue})\)


Question 3

  • Fit the linear regression model in Question 2.


Question 4

  • Interpret \(\hat{\beta_{2}}\), \(\hat{\beta_{4}}\), and \(\hat{\beta_{5}}\), obtained from the linear regression model in Question 3.


Questions 5-6

  • Suppose the model allow for the relationship between the outcome and each of these predictorsโ€”\(\text{housingMedianAge}_{i}\), \(\text{medianIncome}_{i}\), \(\text{AveBedrms}_{i}\), \(\text{AveRooms}_{i}\), and \(\text{AveOccupancy}_{i}\)โ€”varies by location of Census tract.

\[ \begin{align} \log(\text{medianHouseValue})_{i} = &\beta_0 + \beta_1 \text{housingMedianAge}_{i} + \beta_2 \text{medianIncome}_{i}\\ &+ \beta_3 \text{AveBedrms}_{i} + \beta_4 \text{AveRooms}_{i} + \beta_5 \text{AveOccupancy}_{i}\\ &+ \beta_6 \text{longitude}_{i} + \beta_7 \text{latitude}_{i}\\ &+ (\beta_8 \text{housingMedianAge}_{i} + \beta_9 \text{medianIncome}_{i}\\ &\qquad+ \beta_{10} \text{AveBedrms}_{i} + \beta_{11} \text{AveRooms}_{i} + \beta_{12} \text{AveOccupancy}_{i} ) \times \text{longitude}_{i}\\ &+ (\beta_{13} \text{housingMedianAge}_{i} + \beta_{14} \text{medianIncome}_{i}\\ &\qquad+ \beta_{15} \text{AveBedrms}_{i} + \beta_{16} \text{AveRooms}_{i} + \beta_{17} \text{AveOccupancy}_{i} ) \times \text{latitude}_{i}\\ &+ (\beta_{18} \text{housingMedianAge}_{i} + \beta_{19} \text{medianIncome}_{i} \\ &\qquad+ \beta_{20} \text{AveBedrms}_{i} + \beta_{21} \text{AveRooms}_{i} + \beta_{22} \text{AveOccupancy}_{i} )\times \text{longitude}_{i} \times \text{latitude}_{i}\\ &+ \epsilon_{i} \end{align} \]

Question 5

Fit a Lasso linear regression model.


Question 6

Fit a Ridge linear regression model.


Questions 7-10

Consider the tree-based model described below:

\[ \begin{align} \log(\text{medianHouseValue})_{i} = f(&\text{housingMedianAge}_{i}, \text{medianIncome}_{i},\\ &\text{AveBedrms}_{i}, \text{AveRooms}_{i}, \text{AveOccupancy}_{i},\\ &\text{latitude}_{i}, \text{longitude}_{i}) \end{align} \]

Question 7

Fit a regression tree model.


Question 8

Fit a random forest model with num.trees = 500.


Question 9

Fit a gradient-boosted tree model using the following hyperparameter grid:

xgb_grid <- tidyr::crossing(
  nrounds = seq(20, 200, by = 20),
  eta = c(0.025, 0.05, 0.1, 0.3),
  max_depth = c(2, 3),
  gamma = c(0),
  colsample_bytree = c(1),
  min_child_weight = c(1),
  subsample = c(1)
)

xgb_grid |> 
  paged_table()


Question 10

Compare the prediction performance across the models:

  • Linear regression
  • Lasso regression
  • Ridge regression
  • Regression tree
  • Random forest
  • Gradient-boosted tree
Back to top