library(tidyverse)
library(broom)
library(stargazer)
library(skimr)
library(margins)
library(yardstick)
library(WVPlots)
library(pROC)
library(glmnet)
library(Matrix)
library(rmarkdown)
library(hrbrthemes)
library(ggthemes)Midterm Exam
DANL 320-01: Big Data Analytics
Honor Pledges
I solemnly swear that I will not cheat or engage in any form of academic dishonesty during this exam.
I will not communicate with other students or use unauthorized materials.
I will uphold the integrity of this exam and demonstrate my own knowledge and abilities.
By taking this pledge, I acknowledge that academic dishonesty undermines the academic process and is a violation of the trust placed in me as a student.
I accept the consequences of any violation of this promise.
- Studentβs Name: [YOUR_NAME_HERE]
The web-link for the exam questions is here
Below is R packages for this exam:
Section 0. Data Preparation
Prepare data for regressions.
Tasks 1-2
- Convert the following variables to factor type data with the following order:
yearseasonsmonthhrwkdayweather_cond
# year
c("2011", "2012")
# seasons
c("spring", "summer", "fall", "winter")
# month
c("01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12")
# hr
c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12",
"13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23")
# wkday
c("sunday", "monday", "tuesday", "wednesday",
"thursday", "friday", "saturday")
# weather_cond
c("Clear or Few Cloudy",
"Light Snow or Light Rain",
"Mist or Cloudy")- Randomly split the
bikesharedata.frame into training (dtrain) and test (dtest) data.frames.
- 70% of observations in the
bikesharedata.frame go todtrain. - The rest 30% of observations in the
bikesharedata.frame go todtest. - Ensure that you can replicate the random split.
Answer
# YOUR CODE HEREdtrain, dtest, and more
If you were not able to successfully prepare the data for machine learning model estimation, you may use the training (dtrain) and test (dtest) data frames provided below instead.
url <- "https://bcdanl.github.io/data/bikeshare.RDS"
dest <- file.path(tempdir(), "bikeshare.RDS")
download.file(url, destfile = dest, mode = "wb")
bikeshare_ML_ready <- readRDS(dest)
# For Sections 1-2
dtrain <- bikeshare_ML_ready$training
dtest <- bikeshare_ML_ready$test
# For Section 3
y_train <- bikeshare_ML_ready$outcome_train
y_test <- bikeshare_ML_ready$outcome_test
X_train <- bikeshare_ML_ready$input_train
X_test <- bikeshare_ML_ready$input_test
# To remove the following objects:
rm(bikeshare_ML_ready, url, dest)Section 1. Linear Regression
Q1a
Use R to fit the following linear regression model:
\[ \begin{align} \text{over\_load}_{i} =\ &\beta_{\text{intercept}}\\ &+ \beta_{\text{temp}} \, \text{temp}_{i} + \beta_{\text{hum}} \, \text{hum}_{i} + \beta_{\text{windspeed}} \, \text{windspeed}_{i} \nonumber \\ &+ \beta_{\text{year\_2012}} \, \text{year\_2012}_{i}\\ &+ \beta_{\text{month\_2}} \, \text{month\_2}_{i} + \beta_{\text{month\_3}} \, \text{month\_3}_{i} + \beta_{\text{month\_4}} \, \text{month\_4}_{i} \nonumber \\ &+ \beta_{\text{month\_5}} \, \text{month\_5}_{i} + \beta_{\text{month\_6}} \, \text{month\_6}_{i} + \beta_{\text{month\_7}} \, \text{month\_7}_{i} \nonumber \\ &+ \beta_{\text{month\_8}} \, \text{month\_8}_{i} + \beta_{\text{month\_9}} \, \text{month\_9}_{i} + \beta_{\text{month\_10}} \, \text{month\_10}_{i} \nonumber \\ &+ \beta_{\text{month\_11}} \, \text{month\_11}_{i} + \beta_{\text{month\_12}} \, \text{month\_12}_{i} \nonumber \\ &+ \beta_{\text{hr\_1}} \, \text{hr\_1}_{i} + \beta_{\text{hr\_2}} \, \text{hr\_2}_{i} + \beta_{\text{hr\_3}} \, \text{hr\_3}_{i} + \beta_{\text{hr\_4}} \, \text{hr\_4}_{i} \nonumber \\ &+ \beta_{\text{hr\_5}} \, \text{hr\_5}_{i} + \beta_{\text{hr\_6}} \, \text{hr\_6}_{i} + \beta_{\text{hr\_7}} \, \text{hr\_7}_{i} + \beta_{\text{hr\_8}} \, \text{hr\_8}_{i} \nonumber \\ &+ \beta_{\text{hr\_9}} \, \text{hr\_9}_{i} + \beta_{\text{hr\_10}} \, \text{hr\_10}_{i} + \beta_{\text{hr\_11}} \, \text{hr\_11}_{i} + \beta_{\text{hr\_12}} \, \text{hr\_12}_{i} \nonumber \\ &+ \beta_{\text{hr\_13}} \, \text{hr\_13}_{i} + \beta_{\text{hr\_14}} \, \text{hr\_14}_{i} + \beta_{\text{hr\_15}} \, \text{hr\_15}_{i} + \beta_{\text{hr\_16}} \, \text{hr\_16}_{i} \nonumber \\ &+ \beta_{\text{hr\_17}} \, \text{hr\_17}_{i} + \beta_{\text{hr\_18}} \, \text{hr\_18}_{i} + \beta_{\text{hr\_19}} \, \text{hr\_19}_{i} + \beta_{\text{hr\_20}} \, \text{hr\_20}_{i} \nonumber \\ &+ \beta_{\text{hr\_21}} \, \text{hr\_21}_{i} + \beta_{\text{hr\_22}} \, \text{hr\_22}_{i} + \beta_{\text{hr\_23}} \, \text{hr\_23}_{i} \nonumber \\ &+ \beta_{\text{wkday\_monday}} \, \text{wkday\_monday}_{i} + \beta_{\text{wkday\_tuesday}} \, \text{wkday\_tuesday}_{i} \nonumber \\ &+ \beta_{\text{wkday\_wednesday}} \, \text{wkday\_wednesday}_{i} \nonumber \\ &+ \beta_{\text{wkday\_thursday}} \, \text{wkday\_thursday}_{i} + \beta_{\text{wkday\_friday}} \, \text{wkday\_friday}_{i} \nonumber \\ &+ \beta_{\text{wkday\_saturday}} \, \text{wkday\_saturday}_{i} \nonumber \\ &+ \beta_{\text{holiday\_1}} \, \text{holiday\_1}_{i} \nonumber \\ &+ \beta_{\text{seasons\_summer}} \, \text{seasons\_summer}_{i} + \beta_{\text{seasons\_fall}} \, \text{seasons\_fall}_{i} \nonumber \\ &+ \beta_{\text{seasons\_winter}} \, \text{seasons\_winter}_{i} \nonumber \\ &+ \beta_{\text{weather\_cond\_Light\_Snow\_or\_Light\_Rain}} \, \text{weather\_cond\_Light\_Snow\_or\_Light\_Rain}_{i}\nonumber \\ &+ \beta_{\text{weather\_cond\_Mist\_or\_Cloudy}} \, \text{weather\_cond\_Mist\_or\_Cloudy}_{i}\\ &+ \epsilon_{i} \end{align} \]
Answer
# YOUR CODE HEREQ1b
- Provide beta estimates from the model in Q1a.
Answer
# YOUR CODE HEREQ1c
- Draw a coefficient plot for
hrvariables.
Answer
# YOUR CODE HEREQ1d
- Using the estimation result in Q1a:
- Make a prediction on the outcome variable using the test data.frame.
Answer
# YOUR CODE HEREQ1e
- Draw a residual plot using the test data.frame.
Answer
# YOUR CODE HERESection 2. Logistic Regression
Q2a
Use R to fit the following logistic regression model:
\[ \begin{align} \text{Prob}(\text{over\_load}_{i} = 1) =\ G(\;\; &\beta_{\text{intercept}}\\ &+ \beta_{\text{temp}} \, \text{temp}_{i} + \beta_{\text{hum}} \, \text{hum}_{i} + \beta_{\text{windspeed}} \, \text{windspeed}_{i} \nonumber \\ &+ \beta_{\text{year\_2012}} \, \text{year\_2012}_{i}\\ &+ \beta_{\text{month\_2}} \, \text{month\_2}_{i} + \beta_{\text{month\_3}} \, \text{month\_3}_{i} + \beta_{\text{month\_4}} \, \text{month\_4}_{i} \nonumber \\ &+ \beta_{\text{month\_5}} \, \text{month\_5}_{i} + \beta_{\text{month\_6}} \, \text{month\_6}_{i} + \beta_{\text{month\_7}} \, \text{month\_7}_{i} \nonumber \\ &+ \beta_{\text{month\_8}} \, \text{month\_8}_{i} + \beta_{\text{month\_9}} \, \text{month\_9}_{i} + \beta_{\text{month\_10}} \, \text{month\_10}_{i} \nonumber \\ &+ \beta_{\text{month\_11}} \, \text{month\_11}_{i} + \beta_{\text{month\_12}} \, \text{month\_12}_{i} \nonumber \\ &+ \beta_{\text{hr\_1}} \, \text{hr\_1}_{i} + \beta_{\text{hr\_2}} \, \text{hr\_2}_{i} + \beta_{\text{hr\_3}} \, \text{hr\_3}_{i} + \beta_{\text{hr\_4}} \, \text{hr\_4}_{i} \nonumber \\ &+ \beta_{\text{hr\_5}} \, \text{hr\_5}_{i} + \beta_{\text{hr\_6}} \, \text{hr\_6}_{i} + \beta_{\text{hr\_7}} \, \text{hr\_7}_{i} + \beta_{\text{hr\_8}} \, \text{hr\_8}_{i} \nonumber \\ &+ \beta_{\text{hr\_9}} \, \text{hr\_9}_{i} + \beta_{\text{hr\_10}} \, \text{hr\_10}_{i} + \beta_{\text{hr\_11}} \, \text{hr\_11}_{i} + \beta_{\text{hr\_12}} \, \text{hr\_12}_{i} \nonumber \\ &+ \beta_{\text{hr\_13}} \, \text{hr\_13}_{i} + \beta_{\text{hr\_14}} \, \text{hr\_14}_{i} + \beta_{\text{hr\_15}} \, \text{hr\_15}_{i} + \beta_{\text{hr\_16}} \, \text{hr\_16}_{i} \nonumber \\ &+ \beta_{\text{hr\_17}} \, \text{hr\_17}_{i} + \beta_{\text{hr\_18}} \, \text{hr\_18}_{i} + \beta_{\text{hr\_19}} \, \text{hr\_19}_{i} + \beta_{\text{hr\_20}} \, \text{hr\_20}_{i} \nonumber \\ &+ \beta_{\text{hr\_21}} \, \text{hr\_21}_{i} + \beta_{\text{hr\_22}} \, \text{hr\_22}_{i} + \beta_{\text{hr\_23}} \, \text{hr\_23}_{i} \nonumber \\ &+ \beta_{\text{wkday\_monday}} \, \text{wkday\_monday}_{i} + \beta_{\text{wkday\_tuesday}} \, \text{wkday\_tuesday}_{i} \nonumber \\ &+ \beta_{\text{wkday\_wednesday}} \, \text{wkday\_wednesday}_{i} \nonumber \\ &+ \beta_{\text{wkday\_thursday}} \, \text{wkday\_thursday}_{i} + \beta_{\text{wkday\_friday}} \, \text{wkday\_friday}_{i} \nonumber \\ &+ \beta_{\text{wkday\_saturday}} \, \text{wkday\_saturday}_{i} \nonumber \\ &+ \beta_{\text{holiday\_1}} \, \text{holiday\_1}_{i} \nonumber \\ &+ \beta_{\text{seasons\_summer}} \, \text{seasons\_summer}_{i} + \beta_{\text{seasons\_fall}} \, \text{seasons\_fall}_{i} \nonumber \\ &+ \beta_{\text{seasons\_winter}} \, \text{seasons\_winter}_{i} \nonumber \\ &+ \beta_{\text{weather\_cond\_Light\_Snow\_or\_Light\_Rain}} \, \text{weather\_cond\_Light\_Snow\_or\_Light\_Rain}_{i}\nonumber \\ &+ \beta_{\text{weather\_cond\_Mist\_or\_Cloudy}} \, \text{weather\_cond\_Mist\_or\_Cloudy}_{i} \;\;) \end{align} \]
where \(G(\,\cdot\,)\) is
\[ G(\,\cdot\,) = \frac{\exp(\,\cdot\,)}{1 + \exp(\,\cdot\,)}. \]
- Provide a table that summarize the estimation result, including beta estimates, AIC, and residual deviance.
Answer
# YOUR CODE HEREQ2b
- Using the estimation result in Q2a:
- Calculate the average marginal effect of
hrandwkdaydummies.
- Calculate the average marginal effect of
Answer
# YOUR CODE HEREQ2c
- Using the estimation result in Q2a:
- Make a prediction on training data
- Make a prediction on test data
Answer
# YOUR CODE HEREQ2d
- Using the estimation result in Q2a:
- Visualize a double density plot of predictions from the training data.
coord_cartesian(ylim = c(0, 20))can help identify the threshold.
- Visualize a double density plot of predictions from the training data.
Answer
# YOUR CODE HEREQ2e
- Using the estimation result in Q2a, calculate the followings:
- Confusion matrix with the threshold level that clearly separates the double density plots
- Accuracy
- Precision
- Recall
- Specificity
- Base rate
- Enrichment
Answer
# YOUR CODE HEREQ2f
- Using the estimation result in Q2a
- Visualize the variation in recall and enrichment across different threshold levels.
Answer
# YOUR CODE HEREQ2g
- Using the estimation result in Q2a
- Draw the receiver operating characteristic (ROC) curve.
- Calculate the area under the curve (AUC).
Answer
# YOUR CODE HERESection 3. Regularized Regression
Q3a
Prepare the data for regularized regression.
Tasks 1β2
Using the dtrain and dtest data.frames, complete the following:
- Convert the logical response variable
over_loadto an integer vector:- Create
y_trainfromover_loadindtrain - Create
y_testfromover_loadindtest
- Create
- Construct predictor matrices:
- Create
X_trainusing all continuous and dummy predictors indtrain - Create
X_testusing all continuous and dummy predictors indtest
- Create
Answer
# YOUR CODE HEREData for Regularization: y_train, y_test, X_train, and X_test
If you were not able to successfully prepare the data for regularized model estimation, you may use the training (y_train and X_train) and test (y_test and X_test) data provided in Section 0.
Q3b
- Fit a 7-fold cross-validated (CV) Lasso logistic regression, providing
- Beta estimates
- Optimal lambdas
- Plot for CV curve
- Plot for coefficient path
- Plot for the number of nonzero betas as a function of a lambda
- Classification performance, including accuracy, precision, recall, and AUC.
Answer
# YOUR CODE HEREQ3b
- Fit a 7-fold cross-validated (CV) Ridge logistic regression, providing
- Beta estimates
- Optimal lambdas
- Plot for CV curve
- Plot for coefficient path
- Plot for the number of nonzero betas as a function of a lambda
- Classification performance, including accuracy, precision, recall, and AUC.
Answer
# YOUR CODE HERE