Midterm Exam

DANL 320-01: Big Data Analytics

Author

Byeong-Hak Choe

Published

March 11, 2026

Modified

March 11, 2026

Honor Pledges

I solemnly swear that I will not cheat or engage in any form of academic dishonesty during this exam.

I will not communicate with other students or use unauthorized materials.

I will uphold the integrity of this exam and demonstrate my own knowledge and abilities.

By taking this pledge, I acknowledge that academic dishonesty undermines the academic process and is a violation of the trust placed in me as a student.

I accept the consequences of any violation of this promise.

Student’s Name: [YOUR_NAME_HERE]

The web-link for the exam questions is here

Below is R packages for this exam:

library(tidyverse)
library(broom)
library(stargazer)
library(skimr)

library(margins)
library(yardstick)
library(WVPlots)
library(pROC)

library(glmnet)
library(Matrix)

library(rmarkdown)
library(hrbrthemes)
library(ggthemes)

Bikeshare Data

bikeshare <- read_csv('https://bcdanl.github.io/data/bikeshare_cleaned.csv')

# Adding an `over_load` variable
bikeshare <- bikeshare |> 
  mutate(
    over_load = ifelse(cnt > 500,
                       TRUE, FALSE)  # TRUE is equivalent to 1, and FALSE is equivalent to 0.
  )


paged_table(bikeshare)

Variable description

Variable	Description
`cnt`	Count of total rental bikes
`year`	Year
`month`	Month
`date`	Date
`hr`	Hour
`wkday`	Weekday
`holiday`	Holiday indicator (`1` if holiday, `0` otherwise)
`seasons`	Season
`weather_cond`	Weather condition
`temp`	Temperature (measured in standard deviations from average)
`hum`	Humidity (measured in standard deviations from average)
`windspeed`	Wind speed (measured in standard deviations from average)
`over_load`	1 if ` > 500; 0 otherwise

Section 0. Data Preparation

Prepare data for regressions.

Tasks 1-2

Convert the following variables to factor type data with the following order:

year
seasons
month
hr
wkday
weather_cond

# year
c("2011", "2012")

# seasons
c("spring", "summer", "fall", "winter")

# month
c("01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12")

# hr
c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", 
  "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23")

# wkday
c("sunday", "monday", "tuesday", "wednesday", 
  "thursday", "friday", "saturday")

# weather_cond
c("Clear or Few Cloudy", 
  "Light Snow or Light Rain", 
  "Mist or Cloudy")

Randomly split the bikeshare data.frame into training (dtrain) and test (dtest) data.frames.

70% of observations in the bikeshare data.frame go to dtrain.
The rest 30% of observations in the bikeshare data.frame go to dtest.
Ensure that you can replicate the random split.

Answer

# YOUR CODE HERE

`dtrain`, `dtest`, and more

If you were not able to successfully prepare the data for machine learning model estimation, you may use the training (dtrain) and test (dtest) data frames provided below instead.

url <- "https://bcdanl.github.io/data/bikeshare.RDS"
dest <- file.path(tempdir(), "bikeshare.RDS")

download.file(url, destfile = dest, mode = "wb")
bikeshare_ML_ready <- readRDS(dest)

# For Sections 1-2
dtrain <- bikeshare_ML_ready$training
dtest <- bikeshare_ML_ready$test

# For Section 3
y_train <- bikeshare_ML_ready$outcome_train
y_test <- bikeshare_ML_ready$outcome_test
X_train <- bikeshare_ML_ready$input_train
X_test <- bikeshare_ML_ready$input_test

# To remove the following objects:
rm(bikeshare_ML_ready, url, dest)

Section 1. Linear Regression

Q1a

Use R to fit the following linear regression model:

\[ \begin{align} \text{over\_load}_{i} =\ &\beta_{\text{intercept}}\\ &+ \beta_{\text{temp}} \, \text{temp}_{i} + \beta_{\text{hum}} \, \text{hum}_{i} + \beta_{\text{windspeed}} \, \text{windspeed}_{i} \nonumber \\ &+ \beta_{\text{year\_2012}} \, \text{year\_2012}_{i}\\ &+ \beta_{\text{month\_2}} \, \text{month\_2}_{i} + \beta_{\text{month\_3}} \, \text{month\_3}_{i} + \beta_{\text{month\_4}} \, \text{month\_4}_{i} \nonumber \\ &+ \beta_{\text{month\_5}} \, \text{month\_5}_{i} + \beta_{\text{month\_6}} \, \text{month\_6}_{i} + \beta_{\text{month\_7}} \, \text{month\_7}_{i} \nonumber \\ &+ \beta_{\text{month\_8}} \, \text{month\_8}_{i} + \beta_{\text{month\_9}} \, \text{month\_9}_{i} + \beta_{\text{month\_10}} \, \text{month\_10}_{i} \nonumber \\ &+ \beta_{\text{month\_11}} \, \text{month\_11}_{i} + \beta_{\text{month\_12}} \, \text{month\_12}_{i} \nonumber \\ &+ \beta_{\text{hr\_1}} \, \text{hr\_1}_{i} + \beta_{\text{hr\_2}} \, \text{hr\_2}_{i} + \beta_{\text{hr\_3}} \, \text{hr\_3}_{i} + \beta_{\text{hr\_4}} \, \text{hr\_4}_{i} \nonumber \\ &+ \beta_{\text{hr\_5}} \, \text{hr\_5}_{i} + \beta_{\text{hr\_6}} \, \text{hr\_6}_{i} + \beta_{\text{hr\_7}} \, \text{hr\_7}_{i} + \beta_{\text{hr\_8}} \, \text{hr\_8}_{i} \nonumber \\ &+ \beta_{\text{hr\_9}} \, \text{hr\_9}_{i} + \beta_{\text{hr\_10}} \, \text{hr\_10}_{i} + \beta_{\text{hr\_11}} \, \text{hr\_11}_{i} + \beta_{\text{hr\_12}} \, \text{hr\_12}_{i} \nonumber \\ &+ \beta_{\text{hr\_13}} \, \text{hr\_13}_{i} + \beta_{\text{hr\_14}} \, \text{hr\_14}_{i} + \beta_{\text{hr\_15}} \, \text{hr\_15}_{i} + \beta_{\text{hr\_16}} \, \text{hr\_16}_{i} \nonumber \\ &+ \beta_{\text{hr\_17}} \, \text{hr\_17}_{i} + \beta_{\text{hr\_18}} \, \text{hr\_18}_{i} + \beta_{\text{hr\_19}} \, \text{hr\_19}_{i} + \beta_{\text{hr\_20}} \, \text{hr\_20}_{i} \nonumber \\ &+ \beta_{\text{hr\_21}} \, \text{hr\_21}_{i} + \beta_{\text{hr\_22}} \, \text{hr\_22}_{i} + \beta_{\text{hr\_23}} \, \text{hr\_23}_{i} \nonumber \\ &+ \beta_{\text{wkday\_monday}} \, \text{wkday\_monday}_{i} + \beta_{\text{wkday\_tuesday}} \, \text{wkday\_tuesday}_{i} \nonumber \\ &+ \beta_{\text{wkday\_wednesday}} \, \text{wkday\_wednesday}_{i} \nonumber \\ &+ \beta_{\text{wkday\_thursday}} \, \text{wkday\_thursday}_{i} + \beta_{\text{wkday\_friday}} \, \text{wkday\_friday}_{i} \nonumber \\ &+ \beta_{\text{wkday\_saturday}} \, \text{wkday\_saturday}_{i} \nonumber \\ &+ \beta_{\text{holiday\_1}} \, \text{holiday\_1}_{i} \nonumber \\ &+ \beta_{\text{seasons\_summer}} \, \text{seasons\_summer}_{i} + \beta_{\text{seasons\_fall}} \, \text{seasons\_fall}_{i} \nonumber \\ &+ \beta_{\text{seasons\_winter}} \, \text{seasons\_winter}_{i} \nonumber \\ &+ \beta_{\text{weather\_cond\_Light\_Snow\_or\_Light\_Rain}} \, \text{weather\_cond\_Light\_Snow\_or\_Light\_Rain}_{i}\nonumber \\ &+ \beta_{\text{weather\_cond\_Mist\_or\_Cloudy}} \, \text{weather\_cond\_Mist\_or\_Cloudy}_{i}\\ &+ \epsilon_{i} \end{align} \]

Answer

# YOUR CODE HERE

Q1b

Provide beta estimates from the model in Q1a.

Answer

# YOUR CODE HERE

Q1c

Draw a coefficient plot for hr variables.

Answer

# YOUR CODE HERE

Q1d

Using the estimation result in Q1a:
- Make a prediction on the outcome variable using the test data.frame.

Answer

# YOUR CODE HERE

Q1e

Draw a residual plot using the test data.frame.

Answer

# YOUR CODE HERE

Section 2. Logistic Regression

Q2a

Use R to fit the following logistic regression model:

\[ \begin{align} \text{Prob}(\text{over\_load}_{i} = 1) =\ G(\;\; &\beta_{\text{intercept}}\\ &+ \beta_{\text{temp}} \, \text{temp}_{i} + \beta_{\text{hum}} \, \text{hum}_{i} + \beta_{\text{windspeed}} \, \text{windspeed}_{i} \nonumber \\ &+ \beta_{\text{year\_2012}} \, \text{year\_2012}_{i}\\ &+ \beta_{\text{month\_2}} \, \text{month\_2}_{i} + \beta_{\text{month\_3}} \, \text{month\_3}_{i} + \beta_{\text{month\_4}} \, \text{month\_4}_{i} \nonumber \\ &+ \beta_{\text{month\_5}} \, \text{month\_5}_{i} + \beta_{\text{month\_6}} \, \text{month\_6}_{i} + \beta_{\text{month\_7}} \, \text{month\_7}_{i} \nonumber \\ &+ \beta_{\text{month\_8}} \, \text{month\_8}_{i} + \beta_{\text{month\_9}} \, \text{month\_9}_{i} + \beta_{\text{month\_10}} \, \text{month\_10}_{i} \nonumber \\ &+ \beta_{\text{month\_11}} \, \text{month\_11}_{i} + \beta_{\text{month\_12}} \, \text{month\_12}_{i} \nonumber \\ &+ \beta_{\text{hr\_1}} \, \text{hr\_1}_{i} + \beta_{\text{hr\_2}} \, \text{hr\_2}_{i} + \beta_{\text{hr\_3}} \, \text{hr\_3}_{i} + \beta_{\text{hr\_4}} \, \text{hr\_4}_{i} \nonumber \\ &+ \beta_{\text{hr\_5}} \, \text{hr\_5}_{i} + \beta_{\text{hr\_6}} \, \text{hr\_6}_{i} + \beta_{\text{hr\_7}} \, \text{hr\_7}_{i} + \beta_{\text{hr\_8}} \, \text{hr\_8}_{i} \nonumber \\ &+ \beta_{\text{hr\_9}} \, \text{hr\_9}_{i} + \beta_{\text{hr\_10}} \, \text{hr\_10}_{i} + \beta_{\text{hr\_11}} \, \text{hr\_11}_{i} + \beta_{\text{hr\_12}} \, \text{hr\_12}_{i} \nonumber \\ &+ \beta_{\text{hr\_13}} \, \text{hr\_13}_{i} + \beta_{\text{hr\_14}} \, \text{hr\_14}_{i} + \beta_{\text{hr\_15}} \, \text{hr\_15}_{i} + \beta_{\text{hr\_16}} \, \text{hr\_16}_{i} \nonumber \\ &+ \beta_{\text{hr\_17}} \, \text{hr\_17}_{i} + \beta_{\text{hr\_18}} \, \text{hr\_18}_{i} + \beta_{\text{hr\_19}} \, \text{hr\_19}_{i} + \beta_{\text{hr\_20}} \, \text{hr\_20}_{i} \nonumber \\ &+ \beta_{\text{hr\_21}} \, \text{hr\_21}_{i} + \beta_{\text{hr\_22}} \, \text{hr\_22}_{i} + \beta_{\text{hr\_23}} \, \text{hr\_23}_{i} \nonumber \\ &+ \beta_{\text{wkday\_monday}} \, \text{wkday\_monday}_{i} + \beta_{\text{wkday\_tuesday}} \, \text{wkday\_tuesday}_{i} \nonumber \\ &+ \beta_{\text{wkday\_wednesday}} \, \text{wkday\_wednesday}_{i} \nonumber \\ &+ \beta_{\text{wkday\_thursday}} \, \text{wkday\_thursday}_{i} + \beta_{\text{wkday\_friday}} \, \text{wkday\_friday}_{i} \nonumber \\ &+ \beta_{\text{wkday\_saturday}} \, \text{wkday\_saturday}_{i} \nonumber \\ &+ \beta_{\text{holiday\_1}} \, \text{holiday\_1}_{i} \nonumber \\ &+ \beta_{\text{seasons\_summer}} \, \text{seasons\_summer}_{i} + \beta_{\text{seasons\_fall}} \, \text{seasons\_fall}_{i} \nonumber \\ &+ \beta_{\text{seasons\_winter}} \, \text{seasons\_winter}_{i} \nonumber \\ &+ \beta_{\text{weather\_cond\_Light\_Snow\_or\_Light\_Rain}} \, \text{weather\_cond\_Light\_Snow\_or\_Light\_Rain}_{i}\nonumber \\ &+ \beta_{\text{weather\_cond\_Mist\_or\_Cloudy}} \, \text{weather\_cond\_Mist\_or\_Cloudy}_{i} \;\;) \end{align} \]

where \(G(\,\cdot\,)\) is

\[ G(\,\cdot\,) = \frac{\exp(\,\cdot\,)}{1 + \exp(\,\cdot\,)}. \]

Provide a table that summarize the estimation result, including beta estimates, AIC, and residual deviance.

Answer

# YOUR CODE HERE

Q2b

Using the estimation result in Q2a:
- Calculate the average marginal effect of hr and wkday dummies.

Answer

# YOUR CODE HERE

Q2c

Using the estimation result in Q2a:
- Make a prediction on training data
- Make a prediction on test data

Answer

# YOUR CODE HERE

Q2d

Using the estimation result in Q2a:
- Visualize a double density plot of predictions from the training data.
  - coord_cartesian(ylim = c(0, 20)) can help identify the threshold.

Answer

# YOUR CODE HERE

Q2e

Using the estimation result in Q2a, calculate the followings:
- Confusion matrix with the threshold level that clearly separates the double density plots
- Accuracy
- Precision
- Recall
- Specificity
- Base rate
- Enrichment

Answer

# YOUR CODE HERE

Q2f

Using the estimation result in Q2a
- Visualize the variation in recall and enrichment across different threshold levels.

Answer

# YOUR CODE HERE

Q2g

Using the estimation result in Q2a
- Draw the receiver operating characteristic (ROC) curve.
- Calculate the area under the curve (AUC).

Answer

# YOUR CODE HERE

Section 3. Regularized Regression

Q3a

Prepare the data for regularized regression.

Tasks 1–2

Using the dtrain and dtest data.frames, complete the following:

Convert the logical response variable over_load to an integer vector:
1. Create y_train from over_load in dtrain
2. Create y_test from over_load in dtest
Construct predictor matrices:
1. Create X_train using all continuous and dummy predictors in dtrain
2. Create X_test using all continuous and dummy predictors in dtest

Answer

# YOUR CODE HERE

Data for Regularization: `y_train`, `y_test`, `X_train`, and `X_test`

If you were not able to successfully prepare the data for regularized model estimation, you may use the training (y_train and X_train) and test (y_test and X_test) data provided in Section 0.

Q3b

Fit a 7-fold cross-validated (CV) Lasso logistic regression, providing
- Beta estimates
- Optimal lambdas
- Plot for CV curve
- Plot for coefficient path
- Plot for the number of nonzero betas as a function of a lambda
- Classification performance, including accuracy, precision, recall, and AUC.

Answer

# YOUR CODE HERE

Q3b

Fit a 7-fold cross-validated (CV) Ridge logistic regression, providing
- Beta estimates
- Optimal lambdas
- Plot for CV curve
- Plot for coefficient path
- Plot for the number of nonzero betas as a function of a lambda
- Classification performance, including accuracy, precision, recall, and AUC.

Answer

# YOUR CODE HERE

Honor Pledges

Bikeshare Data

Variable description

Section 0. Data Preparation

Answer

dtrain, dtest, and more

Section 1. Linear Regression

Q1a

Answer

Q1b

Answer

Q1c

Answer

Q1d

Answer

Q1e

Answer

Section 2. Logistic Regression

Q2a

Answer

Q2b

Answer

Q2c

Answer

Q2d

Answer

Q2e

Answer

Q2f

Answer

Q2g

Answer

Section 3. Regularized Regression

Q3a

Answer

Data for Regularization: y_train, y_test, X_train, and X_test

Q3b

Answer

Q3b

Answer

`dtrain`, `dtest`, and more

Data for Regularization: `y_train`, `y_test`, `X_train`, and `X_test`