Midterm Exam

DANL 320-01: Big Data Analytics

Author

Byeong-Hak Choe

Published

March 11, 2026

Modified

March 11, 2026

Honor Pledges

I solemnly swear that I will not cheat or engage in any form of academic dishonesty during this exam.

I will not communicate with other students or use unauthorized materials.

I will uphold the integrity of this exam and demonstrate my own knowledge and abilities.

By taking this pledge, I acknowledge that academic dishonesty undermines the academic process and is a violation of the trust placed in me as a student.

I accept the consequences of any violation of this promise.

  • Student’s Name: [YOUR_NAME_HERE]


The web-link for the exam questions is here


Below is R packages for this exam:

library(tidyverse)
library(broom)
library(stargazer)
library(skimr)

library(margins)
library(yardstick)
library(WVPlots)
library(pROC)

library(glmnet)
library(Matrix)

library(rmarkdown)
library(hrbrthemes)
library(ggthemes)

Bikeshare Data

bikeshare <- read_csv('https://bcdanl.github.io/data/bikeshare_cleaned.csv')

# Adding an `over_load` variable
bikeshare <- bikeshare |> 
  mutate(
    over_load = ifelse(cnt > 500,
                       TRUE, FALSE)  # TRUE is equivalent to 1, and FALSE is equivalent to 0.
  )


paged_table(bikeshare)

Variable description

Variable Description
cnt Count of total rental bikes
year Year
month Month
date Date
hr Hour
wkday Weekday
holiday Holiday indicator (1 if holiday, 0 otherwise)
seasons Season
weather_cond Weather condition
temp Temperature (measured in standard deviations from average)
hum Humidity (measured in standard deviations from average)
windspeed Wind speed (measured in standard deviations from average)
over_load 1 if ` > 500; 0 otherwise


Section 0. Data Preparation

Prepare data for regressions.

Tasks 1-2

  1. Convert the following variables to factor type data with the following order:
  • year
  • seasons
  • month
  • hr
  • wkday
  • weather_cond
# year
c("2011", "2012")

# seasons
c("spring", "summer", "fall", "winter")

# month
c("01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12")

# hr
c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", 
  "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23")

# wkday
c("sunday", "monday", "tuesday", "wednesday", 
  "thursday", "friday", "saturday")

# weather_cond
c("Clear or Few Cloudy", 
  "Light Snow or Light Rain", 
  "Mist or Cloudy")
  1. Randomly split the bikeshare data.frame into training (dtrain) and test (dtest) data.frames.
  • 70% of observations in the bikeshare data.frame go to dtrain.
  • The rest 30% of observations in the bikeshare data.frame go to dtest.
  • Ensure that you can replicate the random split.

Answer

# YOUR CODE HERE


dtrain, dtest, and more

If you were not able to successfully prepare the data for machine learning model estimation, you may use the training (dtrain) and test (dtest) data frames provided below instead.

url <- "https://bcdanl.github.io/data/bikeshare.RDS"
dest <- file.path(tempdir(), "bikeshare.RDS")

download.file(url, destfile = dest, mode = "wb")
bikeshare_ML_ready <- readRDS(dest)

# For Sections 1-2
dtrain <- bikeshare_ML_ready$training
dtest <- bikeshare_ML_ready$test

# For Section 3
y_train <- bikeshare_ML_ready$outcome_train
y_test <- bikeshare_ML_ready$outcome_test
X_train <- bikeshare_ML_ready$input_train
X_test <- bikeshare_ML_ready$input_test

# To remove the following objects:
rm(bikeshare_ML_ready, url, dest)



Section 1. Linear Regression

Q1a

Use R to fit the following linear regression model:

\[ \begin{align} \text{over\_load}_{i} =\ &\beta_{\text{intercept}}\\ &+ \beta_{\text{temp}} \, \text{temp}_{i} + \beta_{\text{hum}} \, \text{hum}_{i} + \beta_{\text{windspeed}} \, \text{windspeed}_{i} \nonumber \\ &+ \beta_{\text{year\_2012}} \, \text{year\_2012}_{i}\\ &+ \beta_{\text{month\_2}} \, \text{month\_2}_{i} + \beta_{\text{month\_3}} \, \text{month\_3}_{i} + \beta_{\text{month\_4}} \, \text{month\_4}_{i} \nonumber \\ &+ \beta_{\text{month\_5}} \, \text{month\_5}_{i} + \beta_{\text{month\_6}} \, \text{month\_6}_{i} + \beta_{\text{month\_7}} \, \text{month\_7}_{i} \nonumber \\ &+ \beta_{\text{month\_8}} \, \text{month\_8}_{i} + \beta_{\text{month\_9}} \, \text{month\_9}_{i} + \beta_{\text{month\_10}} \, \text{month\_10}_{i} \nonumber \\ &+ \beta_{\text{month\_11}} \, \text{month\_11}_{i} + \beta_{\text{month\_12}} \, \text{month\_12}_{i} \nonumber \\ &+ \beta_{\text{hr\_1}} \, \text{hr\_1}_{i} + \beta_{\text{hr\_2}} \, \text{hr\_2}_{i} + \beta_{\text{hr\_3}} \, \text{hr\_3}_{i} + \beta_{\text{hr\_4}} \, \text{hr\_4}_{i} \nonumber \\ &+ \beta_{\text{hr\_5}} \, \text{hr\_5}_{i} + \beta_{\text{hr\_6}} \, \text{hr\_6}_{i} + \beta_{\text{hr\_7}} \, \text{hr\_7}_{i} + \beta_{\text{hr\_8}} \, \text{hr\_8}_{i} \nonumber \\ &+ \beta_{\text{hr\_9}} \, \text{hr\_9}_{i} + \beta_{\text{hr\_10}} \, \text{hr\_10}_{i} + \beta_{\text{hr\_11}} \, \text{hr\_11}_{i} + \beta_{\text{hr\_12}} \, \text{hr\_12}_{i} \nonumber \\ &+ \beta_{\text{hr\_13}} \, \text{hr\_13}_{i} + \beta_{\text{hr\_14}} \, \text{hr\_14}_{i} + \beta_{\text{hr\_15}} \, \text{hr\_15}_{i} + \beta_{\text{hr\_16}} \, \text{hr\_16}_{i} \nonumber \\ &+ \beta_{\text{hr\_17}} \, \text{hr\_17}_{i} + \beta_{\text{hr\_18}} \, \text{hr\_18}_{i} + \beta_{\text{hr\_19}} \, \text{hr\_19}_{i} + \beta_{\text{hr\_20}} \, \text{hr\_20}_{i} \nonumber \\ &+ \beta_{\text{hr\_21}} \, \text{hr\_21}_{i} + \beta_{\text{hr\_22}} \, \text{hr\_22}_{i} + \beta_{\text{hr\_23}} \, \text{hr\_23}_{i} \nonumber \\ &+ \beta_{\text{wkday\_monday}} \, \text{wkday\_monday}_{i} + \beta_{\text{wkday\_tuesday}} \, \text{wkday\_tuesday}_{i} \nonumber \\ &+ \beta_{\text{wkday\_wednesday}} \, \text{wkday\_wednesday}_{i} \nonumber \\ &+ \beta_{\text{wkday\_thursday}} \, \text{wkday\_thursday}_{i} + \beta_{\text{wkday\_friday}} \, \text{wkday\_friday}_{i} \nonumber \\ &+ \beta_{\text{wkday\_saturday}} \, \text{wkday\_saturday}_{i} \nonumber \\ &+ \beta_{\text{holiday\_1}} \, \text{holiday\_1}_{i} \nonumber \\ &+ \beta_{\text{seasons\_summer}} \, \text{seasons\_summer}_{i} + \beta_{\text{seasons\_fall}} \, \text{seasons\_fall}_{i} \nonumber \\ &+ \beta_{\text{seasons\_winter}} \, \text{seasons\_winter}_{i} \nonumber \\ &+ \beta_{\text{weather\_cond\_Light\_Snow\_or\_Light\_Rain}} \, \text{weather\_cond\_Light\_Snow\_or\_Light\_Rain}_{i}\nonumber \\ &+ \beta_{\text{weather\_cond\_Mist\_or\_Cloudy}} \, \text{weather\_cond\_Mist\_or\_Cloudy}_{i}\\ &+ \epsilon_{i} \end{align} \]

Answer

# YOUR CODE HERE


Q1b

  • Provide beta estimates from the model in Q1a.

Answer

# YOUR CODE HERE


Q1c

  • Draw a coefficient plot for hr variables.

Answer

# YOUR CODE HERE


Q1d

  • Using the estimation result in Q1a:
    • Make a prediction on the outcome variable using the test data.frame.

Answer

# YOUR CODE HERE


Q1e

  • Draw a residual plot using the test data.frame.

Answer

# YOUR CODE HERE




Section 2. Logistic Regression

Q2a

Use R to fit the following logistic regression model:

\[ \begin{align} \text{Prob}(\text{over\_load}_{i} = 1) =\ G(\;\; &\beta_{\text{intercept}}\\ &+ \beta_{\text{temp}} \, \text{temp}_{i} + \beta_{\text{hum}} \, \text{hum}_{i} + \beta_{\text{windspeed}} \, \text{windspeed}_{i} \nonumber \\ &+ \beta_{\text{year\_2012}} \, \text{year\_2012}_{i}\\ &+ \beta_{\text{month\_2}} \, \text{month\_2}_{i} + \beta_{\text{month\_3}} \, \text{month\_3}_{i} + \beta_{\text{month\_4}} \, \text{month\_4}_{i} \nonumber \\ &+ \beta_{\text{month\_5}} \, \text{month\_5}_{i} + \beta_{\text{month\_6}} \, \text{month\_6}_{i} + \beta_{\text{month\_7}} \, \text{month\_7}_{i} \nonumber \\ &+ \beta_{\text{month\_8}} \, \text{month\_8}_{i} + \beta_{\text{month\_9}} \, \text{month\_9}_{i} + \beta_{\text{month\_10}} \, \text{month\_10}_{i} \nonumber \\ &+ \beta_{\text{month\_11}} \, \text{month\_11}_{i} + \beta_{\text{month\_12}} \, \text{month\_12}_{i} \nonumber \\ &+ \beta_{\text{hr\_1}} \, \text{hr\_1}_{i} + \beta_{\text{hr\_2}} \, \text{hr\_2}_{i} + \beta_{\text{hr\_3}} \, \text{hr\_3}_{i} + \beta_{\text{hr\_4}} \, \text{hr\_4}_{i} \nonumber \\ &+ \beta_{\text{hr\_5}} \, \text{hr\_5}_{i} + \beta_{\text{hr\_6}} \, \text{hr\_6}_{i} + \beta_{\text{hr\_7}} \, \text{hr\_7}_{i} + \beta_{\text{hr\_8}} \, \text{hr\_8}_{i} \nonumber \\ &+ \beta_{\text{hr\_9}} \, \text{hr\_9}_{i} + \beta_{\text{hr\_10}} \, \text{hr\_10}_{i} + \beta_{\text{hr\_11}} \, \text{hr\_11}_{i} + \beta_{\text{hr\_12}} \, \text{hr\_12}_{i} \nonumber \\ &+ \beta_{\text{hr\_13}} \, \text{hr\_13}_{i} + \beta_{\text{hr\_14}} \, \text{hr\_14}_{i} + \beta_{\text{hr\_15}} \, \text{hr\_15}_{i} + \beta_{\text{hr\_16}} \, \text{hr\_16}_{i} \nonumber \\ &+ \beta_{\text{hr\_17}} \, \text{hr\_17}_{i} + \beta_{\text{hr\_18}} \, \text{hr\_18}_{i} + \beta_{\text{hr\_19}} \, \text{hr\_19}_{i} + \beta_{\text{hr\_20}} \, \text{hr\_20}_{i} \nonumber \\ &+ \beta_{\text{hr\_21}} \, \text{hr\_21}_{i} + \beta_{\text{hr\_22}} \, \text{hr\_22}_{i} + \beta_{\text{hr\_23}} \, \text{hr\_23}_{i} \nonumber \\ &+ \beta_{\text{wkday\_monday}} \, \text{wkday\_monday}_{i} + \beta_{\text{wkday\_tuesday}} \, \text{wkday\_tuesday}_{i} \nonumber \\ &+ \beta_{\text{wkday\_wednesday}} \, \text{wkday\_wednesday}_{i} \nonumber \\ &+ \beta_{\text{wkday\_thursday}} \, \text{wkday\_thursday}_{i} + \beta_{\text{wkday\_friday}} \, \text{wkday\_friday}_{i} \nonumber \\ &+ \beta_{\text{wkday\_saturday}} \, \text{wkday\_saturday}_{i} \nonumber \\ &+ \beta_{\text{holiday\_1}} \, \text{holiday\_1}_{i} \nonumber \\ &+ \beta_{\text{seasons\_summer}} \, \text{seasons\_summer}_{i} + \beta_{\text{seasons\_fall}} \, \text{seasons\_fall}_{i} \nonumber \\ &+ \beta_{\text{seasons\_winter}} \, \text{seasons\_winter}_{i} \nonumber \\ &+ \beta_{\text{weather\_cond\_Light\_Snow\_or\_Light\_Rain}} \, \text{weather\_cond\_Light\_Snow\_or\_Light\_Rain}_{i}\nonumber \\ &+ \beta_{\text{weather\_cond\_Mist\_or\_Cloudy}} \, \text{weather\_cond\_Mist\_or\_Cloudy}_{i} \;\;) \end{align} \]

where \(G(\,\cdot\,)\) is

\[ G(\,\cdot\,) = \frac{\exp(\,\cdot\,)}{1 + \exp(\,\cdot\,)}. \]

  • Provide a table that summarize the estimation result, including beta estimates, AIC, and residual deviance.

Answer

# YOUR CODE HERE


Q2b

  • Using the estimation result in Q2a:
    • Calculate the average marginal effect of hr and wkday dummies.

Answer

# YOUR CODE HERE


Q2c

  • Using the estimation result in Q2a:
    • Make a prediction on training data
    • Make a prediction on test data

Answer

# YOUR CODE HERE


Q2d

  • Using the estimation result in Q2a:
    • Visualize a double density plot of predictions from the training data.
      • coord_cartesian(ylim = c(0, 20)) can help identify the threshold.

Answer

# YOUR CODE HERE


Q2e

  • Using the estimation result in Q2a, calculate the followings:
    • Confusion matrix with the threshold level that clearly separates the double density plots
    • Accuracy
    • Precision
    • Recall
    • Specificity
    • Base rate
    • Enrichment

Answer

# YOUR CODE HERE


Q2f

  • Using the estimation result in Q2a
    • Visualize the variation in recall and enrichment across different threshold levels.

Answer

# YOUR CODE HERE


Q2g

  • Using the estimation result in Q2a
    • Draw the receiver operating characteristic (ROC) curve.
    • Calculate the area under the curve (AUC).

Answer

# YOUR CODE HERE


Section 3. Regularized Regression

Q3a

Prepare the data for regularized regression.

Tasks 1–2

Using the dtrain and dtest data.frames, complete the following:

  • Convert the logical response variable over_load to an integer vector:
    1. Create y_train from over_load in dtrain
    2. Create y_test from over_load in dtest
  • Construct predictor matrices:
    1. Create X_train using all continuous and dummy predictors in dtrain
    2. Create X_test using all continuous and dummy predictors in dtest

Answer

# YOUR CODE HERE


Data for Regularization: y_train, y_test, X_train, and X_test

If you were not able to successfully prepare the data for regularized model estimation, you may use the training (y_train and X_train) and test (y_test and X_test) data provided in Section 0.


Q3b

  • Fit a 7-fold cross-validated (CV) Lasso logistic regression, providing
    • Beta estimates
    • Optimal lambdas
    • Plot for CV curve
    • Plot for coefficient path
    • Plot for the number of nonzero betas as a function of a lambda
    • Classification performance, including accuracy, precision, recall, and AUC.

Answer

# YOUR CODE HERE


Q3b

  • Fit a 7-fold cross-validated (CV) Ridge logistic regression, providing
    • Beta estimates
    • Optimal lambdas
    • Plot for CV curve
    • Plot for coefficient path
    • Plot for the number of nonzero betas as a function of a lambda
    • Classification performance, including accuracy, precision, recall, and AUC.

Answer

# YOUR CODE HERE


Back to top