library(Matrix) # to create a sparse matrix
library(tidyverse)
library(broom)
library(stargazer)
library(glmnet)
library(gamlr) # a companion package for glmnet
library(DT)
library(rmarkdown)
library(hrbrthemes)
library(ggthemes)
theme_set(
theme_ipsum()
)
scale_colour_discrete <- function(...) scale_color_colorblind(...)
scale_fill_discrete <- function(...) scale_fill_colorblind(...)Classwork 6
Regularized Linear Regression β Online Shopping
Setup
Load Data
url <- "https://bcdanl.github.io/data/browser-online-shopping.zip"
dest <- file.path(tempdir(), "browser-online-shopping.zip")
download.file(url, destfile = dest, mode = "wb")
df <- readr::read_csv(dest)Model
- The browser dataset contains web browsing logs for 10,000 households.
- The browser dataset include a yearβs worth of their browser logs for the 1,000 most heavily trafficked websites
- Each browser in the sample spent at least $1 online in the same year.
\[ \log(\text{spend}_{i}) = \beta_0 + \beta_1 X_{1,i} +\,\cdots\,+ \beta_{1000} X_{1000,i} + \epsilon_i \]
- \(\text{spend}_{i}\): household \(i\)βs amount of dollars spent on online shopping
- \(X_{p, i}\) household \(i\)βs percentage of visiting the \(p\) website
Checking Data
df |>
transmute(row_sum = rowSums(across(-1))) |> # Row-wise sum, except for the first column
skim()Train-Test Split
set.seed(1)
df <- df |>
mutate(rnd = runif(n()))
dtrain <- df |>
filter(rnd > .2) |>
select(-rnd)
dtest <- df |>
filter(rnd <= .2) |>
select(-rnd)
# as.matrix() convert a data.frame into a matrix
X_train <- dtrain |> select(-spend) |> as.matrix()
X_test <- dtest |> select(-spend) |> as.matrix()
y_train <- log(dtrain$spend)
y_test <- log(dtest$spend)Questions
Question 1. Linear Regression
- Fit the linear regression model using the following formula:
log(spend) ~ . # including all remaining variables as predictorsQuestion 2. Regularized Linear Regression
- Fit 5-fold cross-validated lasso and ridge linear regression models using the following options in
cv.glmnet():
nfolds = 5 # number of folds in cross validation; 10 is default- In your
cv.glmnet(), choose one of the following options:intercept = TRUE: Default. Intercept is included in the model, and is not penalized.intercept = FALSE: Intercept is excluded from the model.
- Note that if
familyoption is skipped incv.glmnet(), linear regression (family = "gaussian") is set by default.
Question 3. Plots
- Plot:
- Cross-validation error curves
- Coefficient paths
- The number of non-zero betas as a function of a lambda
- Coefficient plots with the top 20 websites.
- Coefficient plots with the bottom 20 websites.
- Residual plots
Question 4. Inferences
- Interpret the relationship between spending and a visit of a website you pick.
Question 5. Prediction and Evaluation
- Predict an outcome and evaluate each model using an RMSE.
Discussion
Welcome to our Classwork 6 Discussion Board! π
This space is designed for you to engage with your classmates about the material covered in Classwork 6.
Whether you are looking to delve deeper into the content, share insights, or have questions about the content, this is the perfect place for you.
If you have any specific questions for Byeong-Hak (@bcdanl) regarding the Classwork 6 materials or need clarification on any points, donβt hesitate to ask here.
All comments will be stored here.
Letβs collaborate and learn from each other!