Classwork 6

Regularized Linear Regression β€” Online Shopping

Author

Byeong-Hak Choe

Published

March 4, 2026

Modified

March 4, 2026

Setup

library(Matrix) # to create a sparse matrix

library(tidyverse)
library(broom)
library(stargazer)
library(glmnet) 
library(gamlr)   # a companion package for glmnet

library(DT)
library(rmarkdown)
library(hrbrthemes)
library(ggthemes)

theme_set(
  theme_ipsum()
)

scale_colour_discrete <- function(...) scale_color_colorblind(...)
scale_fill_discrete <- function(...) scale_fill_colorblind(...)

Load Data

url <- "https://bcdanl.github.io/data/browser-online-shopping.zip"
dest <- file.path(tempdir(), "browser-online-shopping.zip")

download.file(url, destfile = dest, mode = "wb")
df <- readr::read_csv(dest)

Model

  • The browser dataset contains web browsing logs for 10,000 households.
  • The browser dataset include a year’s worth of their browser logs for the 1,000 most heavily trafficked websites
  • Each browser in the sample spent at least $1 online in the same year.

\[ \log(\text{spend}_{i}) = \beta_0 + \beta_1 X_{1,i} +\,\cdots\,+ \beta_{1000} X_{1000,i} + \epsilon_i \]

  • \(\text{spend}_{i}\): household \(i\)’s amount of dollars spent on online shopping
  • \(X_{p, i}\) household \(i\)’s percentage of visiting the \(p\) website

Checking Data

df |> 
  transmute(row_sum = rowSums(across(-1))) |> # Row-wise sum, except for the first column
  skim()

Train-Test Split

set.seed(1)

df <- df |> 
  mutate(rnd = runif(n()))

dtrain <- df |> 
  filter(rnd > .2) |> 
  select(-rnd)
dtest <- df |> 
  filter(rnd <= .2) |> 
  select(-rnd)

# as.matrix() convert a data.frame into a matrix
X_train <- dtrain |> select(-spend) |> as.matrix()
X_test  <- dtest  |> select(-spend) |> as.matrix()

y_train <- log(dtrain$spend)
y_test  <- log(dtest$spend)

Questions

Question 1. Linear Regression

  • Fit the linear regression model using the following formula:
log(spend) ~ .    # including all remaining variables as predictors


Question 2. Regularized Linear Regression

  • Fit 5-fold cross-validated lasso and ridge linear regression models using the following options in cv.glmnet():
nfolds = 5    # number of folds in cross validation; 10 is default
  • In your cv.glmnet(), choose one of the following options:
    • intercept = TRUE: Default. Intercept is included in the model, and is not penalized.
    • intercept = FALSE: Intercept is excluded from the model.
  • Note that if family option is skipped in cv.glmnet(), linear regression (family = "gaussian") is set by default.


Question 3. Plots

  • Plot:
    • Cross-validation error curves
    • Coefficient paths
    • The number of non-zero betas as a function of a lambda
    • Coefficient plots with the top 20 websites.
    • Coefficient plots with the bottom 20 websites.
    • Residual plots


Question 4. Inferences

  • Interpret the relationship between spending and a visit of a website you pick.


Question 5. Prediction and Evaluation

  • Predict an outcome and evaluate each model using an RMSE.



Discussion

Welcome to our Classwork 6 Discussion Board! πŸ‘‹

This space is designed for you to engage with your classmates about the material covered in Classwork 6.

Whether you are looking to delve deeper into the content, share insights, or have questions about the content, this is the perfect place for you.

If you have any specific questions for Byeong-Hak (@bcdanl) regarding the Classwork 6 materials or need clarification on any points, don’t hesitate to ask here.

All comments will be stored here.

Let’s collaborate and learn from each other!

Back to top