K-fold Cross-Validation; Regularized Regression
March 26, 2025
Partitioning Data for 3-fold Cross-Validation
Training Data:
The portion of data used to fit the model.
Within this set, k‑fold cross-validation is applied.
Validation Data:
Temporary splits within the training set during cross-validation, used to tune hyperparameters and assess performance.
Test Data:
A held-out dataset that is never used during model tuning.
Provides an unbiased evaluation of the final model’s performance.
Workflow:
Regularized regression can resolve the following problems:
The above situations usually happen when the model is too complex (e.g., has large or many beta variables).
We will discuss three regularized regression methods:
\[ f(x_i) = b_0 + b_1 x_{1,i} + b_2 x_{2,i} + \,\cdots\, + b_p x_{p,i} \] is as close as possible to \(y_i\) for all the training data by minimizing the sum of the squared error (SSE) between \(y\) and \(f(x)\) with observations \(i = 1, \cdots, N\), where the SSE is
\[ (y_1 - f(x_1))^2 + (y_2 - f(x_2))^2 + \,\cdots\, + (y_N - f(x_N))^2 \]
\[ f(x_i) = b_0 + b_1 x_{1,i} + b_2 x_{2,i} + \,\cdots\, + b_p x_{p,i} \] is as close as possible to \(y_i\) for all the training data by minimizing the sum of the squared error (SSE) plus the sum of the absolute value of the beta parameters multiplied by the alpha parameter:
\[ \begin{align} &(y_1 - f(x_1))^2 + (y_2 - f(x_2))^2 + \,\cdots\, + (y_N - f(x_N))^2 \\ &+ \alpha \times(| \beta_1 | + |\beta_2 | + \,\cdots\, + |\beta_{p}|) \end{align} \] - When \(\alpha = 0\), this reduces to regular linear regression.
When variables are nearly collinear, lasso regression tends to drive one or more of them to zero.
In the regression of \(\text{income}\), lasso regression might give zero credit to one of the two variables, \(\texttt{age}\) and \(\texttt{years_of_workforce}\).
For this reason, lasso regression is often used as a form of model/variable selection.
\[ f(x_i) = b_0 + b_1 x_{1,i} + b_2 x_{2,i} + \,\cdots\, + b_p x_{p,i} \] is as close as possible to \(y_i\) for all the training data by minimizing the sum of the squared error (SSE) plus the sum of the squared beta parameters multiplied by the alpha parameter:
\[ \begin{align} &(y_1 - f(x_1))^2 + (y_2 - f(x_2))^2 + \,\cdots\, + (y_N - f(x_N))^2 \\ &+ \alpha \times (\beta_1^2 + \beta_2^2 + \,\cdots\, + \beta_{p}^2) \end{align} \] - When \(\alpha = 0\), this reduces to regular linear regression.
\[ f(x_i) = b_0 + b_1 x_{1,i} + b_2 x_{2,i} + \,\cdots\, + b_p x_{p,i} \] is as close as possible to \(y_i\) for all the training data by minimizing the sum of the squared error (SSE) plus a linear combination of the ridge and the lasso penalties with the \(\lambda\) parameter:
\[ \begin{align} &(y_1 - f(x_1))^2 + (y_2 - f(x_2))^2 + \,\cdots\, + (y_N - f(x_N))^2 \\ &+ \lambda \times(| \beta_1 | + |\beta_2 | + \,\cdots\, + |\beta_{p}|)\\ &+ (1-\lambda) \times(\beta_1^2 + \beta_2^2 + \,\cdots\, + \beta_{p}^2) \end{align} \]
where \(0 \leq \lambda \leq 1\).
scikit-learn
scikit-learn
Here are the required libraries and modules for logistic regression with regularization:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.metrics import (confusion_matrix, accuracy_score, precision_score,
recall_score, roc_curve, roc_auc_score)
train_test_split()
and pd.get_dummies()
cars = pd.read_csv('https://bcdanl.github.io/data/car-data.csv')
# Split into train and test (70% train, 30% test)
# Using a fixed random state for reproducibility (seed = 24351)
cars_train, cars_test = train_test_split(cars, test_size=0.3, random_state=24351)
# Define predictors: all columns except "rating" and "fail"
predictors = [col for col in cars.columns if col not in ['rating', 'fail']]
# One-hot encode categorical predictors.
X_train = pd.get_dummies(cars_train[predictors])
X_test = pd.get_dummies(cars_test[predictors])
# Ensure that the test set has the same dummy columns as the training set
X_test = X_test.reindex(columns=X_train.columns)
# Outcome variable
y_train = cars_train['fail'].astype(int)
y_test = cars_test['fail'].astype(int)
LogisticRegressionCV(penalty='l1', solver='saga')
lasso_cv = LogisticRegressionCV(Cs=100, cv=5,
penalty='l1', solver='saga', max_iter=1000, scoring='neg_log_loss')
lasso_cv.fit(X_train, y_train)
Cs=100
: Tries 100 different values of regularization strength (1/C)cv=5
: Uses 5-fold cross-validationpenalty='l1'
: Applies L1 regularization (Lasso)solver='saga'
Supports L1 regularizationmax_iter=1000
: Sets the maximum number of iterations for solver to convergescoring='neg_log_loss'
: Uses deviance as the CV performance metric.fit(X_train, y_train)
: Fits the model; Selects the best \(\alpha = \frac{1}{C}\).# Note: solver='saga' supports L1 regularization.
lasso_cv = LogisticRegressionCV(
Cs=100, cv=5, penalty='l1', solver='saga', max_iter=1000, scoring='neg_log_loss'
)
lasso_cv.fit(X_train, y_train)
intercept = float(lasso_cv.intercept_)
coef_lasso = pd.DataFrame({
'predictor': list(X_train.columns),
'coefficient': list(lasso_cv.coef_[0])
})
print("Lasso Regression Coefficients:")
print(coef_lasso)
# Force an order for the y-axis (using the feature names as they appear in coef_lasso)
order = coef_lasso['predictor'].tolist()
plt.figure(figsize=(8,6))
ax = sns.pointplot(x="coefficient", y="predictor", data=coef_lasso, order=order, join=False)
plt.title("Coefficients of Lasso Logistic Regression Model")
plt.xlabel("Coefficient value")
plt.ylabel("Predictor")
# Draw horizontal lines from 0 to each coefficient.
for _, row in coef_lasso.iterrows():
# Get the y-axis position from the order list.
y_pos = order.index(row['predictor'])
plt.hlines(y=y_pos, xmin=0, xmax=row['coefficient'], color='gray', linestyle='--')
# Draw a vertical line at 0.
plt.axvline(0, color='black', linestyle='--')
plt.show()
# Prediction and evaluation for lasso model
y_pred_prob_lasso = lasso_cv.predict_proba(X_test)[:, 1]
y_pred_lasso = (y_pred_prob_lasso > 0.5).astype(int)
ctab_lasso = confusion_matrix(y_test, y_pred_lasso)
accuracy_lasso = accuracy_score(y_test, y_pred_lasso)
precision_lasso = precision_score(y_test, y_pred_lasso)
recall_lasso = recall_score(y_test, y_pred_lasso)
auc_lasso = roc_auc_score(y_test, y_pred_prob_lasso)
print("Confusion Matrix (Lasso):\n", ctab_lasso)
print("Lasso Accuracy:", accuracy_lasso)
print("Lasso Precision:", precision_lasso)
print("Lasso Recall:", recall_lasso)
# Plot ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob_lasso)
plt.figure(figsize=(8,6))
plt.plot(fpr, tpr, label=f'Lasso (AUC = {auc_ridge:.2f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random Guess')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for Lasso Logistic Regression Model')
plt.legend(loc='best')
plt.show()
LogisticRegressionCV(penalty='l2')
ridge_cv = LogisticRegressionCV(Cs=100, cv=5,
penalty='l2', solver='lbfgs', max_iter=1000, scoring='neg_log_loss'
)
ridge_cv.fit(X_train, y_train)
penalty='l2'
: Applies L2 regularization (Ridge)solver='lbfgs'
Supports L2 regularization (optional)# LogisticRegressionCV automatically selects the best regularization strength.
ridge_cv = LogisticRegressionCV(
Cs=100, cv=5, penalty='l2', solver='lbfgs', max_iter=1000, scoring='neg_log_loss'
)
ridge_cv.fit(X_train, y_train)
print("Ridge Regression - Best C (inverse of regularization strength):", ridge_cv.C_[0])
intercept = float(ridge_cv.intercept_)
coef_ridge = pd.DataFrame({
'predictor': list(X_train.columns),
'coefficient': list(ridge_cv.coef_[0])
})
print("Ridge Regression Coefficients:")
print(coef_ridge)
# Force an order for the y-axis (using the feature names as they appear in coef_ridge)
order = coef_ridge['predictor'].tolist()
plt.figure(figsize=(8,6))
ax = sns.pointplot(x="coefficient", y="predictor", data=coef_ridge, order=order, join=False)
plt.title("Coefficients of Ridge Logistic Regression Model")
plt.xlabel("Coefficient value")
plt.ylabel("Predictor")
# Draw horizontal lines from 0 to each coefficient.
for _, row in coef_ridge.iterrows():
# Get the y-axis position from the order list.
y_pos = order.index(row['predictor'])
plt.hlines(y=y_pos, xmin=0, xmax=row['coefficient'], color='gray', linestyle='--')
# Draw a vertical line at 0.
plt.axvline(0, color='black', linestyle='--')
plt.show()
# Prediction and evaluation for ridge model
y_pred_prob_ridge = ridge_cv.predict_proba(X_test)[:, 1]
y_pred_ridge = (y_pred_prob_ridge > 0.5).astype(int)
ctab_ridge = confusion_matrix(y_test, y_pred_ridge)
accuracy_ridge = accuracy_score(y_test, y_pred_ridge)
precision_ridge = precision_score(y_test, y_pred_ridge)
recall_ridge = recall_score(y_test, y_pred_ridge)
auc_ridge = roc_auc_score(y_test, y_pred_prob_ridge)
print("Confusion Matrix (Ridge):\n", ctab_ridge)
print("Ridge Accuracy:", accuracy_ridge)
print("Ridge Precision:", precision_ridge)
print("Ridge Recall:", recall_ridge)
# Plot ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob_ridge)
plt.figure(figsize=(8,6))
plt.plot(fpr, tpr, label=f'Ridge (AUC = {auc_ridge:.2f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random Guess')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for Ridge Logistic Regression Model')
plt.legend(loc='best')
plt.show()
LogisticRegressionCV(penalty='l2')
enet_cv = LogisticRegressionCV(
Cs=10, cv=5, penalty='elasticnet', solver='saga',
l1_ratios=[0.5, 0.7, 0.9], max_iter=1000, scoring='neg_log_loss'
)
enet_cv.fit(X_train, y_train)
Cs=10
: Tries 10 different values of regularization strength (1/C)cv=5
: Uses 5-fold cross-validationpenalty='elasticnet'
: Applies Elastic Net regularizationl1_ratios = [0.5, 0.7, 0.9]
In this model, we try 0.5, 0.7, 0.9
l1_ratio = 1
→ Lasso
l1_ratio = 0
→ Ridge# LogisticRegressionCV supports elastic net penalty with solver 'saga'.
# l1_ratio specifies the mix between L1 and L2 (0 = ridge, 1 = lasso).
enet_cv = LogisticRegressionCV(
Cs=10, cv=5, penalty='elasticnet', solver='saga',
l1_ratios=[0.5, 0.7, 0.9], max_iter=1000, scoring='neg_log_loss'
)
enet_cv.fit(X_train, y_train)
print("Elastic Net Regression - Best C:", enet_cv.C_[0])
print("Elastic Net Regression - Best l1 ratio:", enet_cv.l1_ratio_[0])
intercept = float(enet_cv.intercept_)
coef_enet = pd.DataFrame({
'predictor': list(X_train.columns),
'coefficient': list(enet_cv.coef_[0])
})
print("Elastic Net Regression Coefficients:")
print(coef_enet)
# Force an order for the y-axis (using the feature names as they appear in coef_lasso)
order = coef_enet['predictor'].tolist()
plt.figure(figsize=(8,6))
ax = sns.pointplot(x="coefficient", y="predictor", data=coef_enet, order=order, join=False)
plt.title("Coefficients of Elastic Net Logistic Regression Model")
plt.xlabel("Coefficient value")
plt.ylabel("Predictor")
# Draw horizontal lines from 0 to each coefficient.
for _, row in coef_enet.iterrows():
# Get the y-axis position from the order list.
y_pos = order.index(row['predictor'])
plt.hlines(y=y_pos, xmin=0, xmax=row['coefficient'], color='gray', linestyle='--')
# Draw a vertical line at 0.
plt.axvline(0, color='black', linestyle='--')
plt.show()
# Prediction and evaluation for elastic net model
y_pred_prob_enet = enet_cv.predict_proba(X_test)[:, 1]
y_pred_enet = (y_pred_prob_enet > 0.5).astype(int)
ctab_enet = confusion_matrix(y_test, y_pred_enet)
accuracy_enet = accuracy_score(y_test, y_pred_enet)
precision_enet = precision_score(y_test, y_pred_enet)
recall_enet = recall_score(y_test, y_pred_enet)
print("Confusion Matrix (Elastic Net):\n", ctab_enet)
print("Elastic Net Accuracy:", accuracy_enet)
print("Elastic Net Precision:", precision_enet)
print("Elastic Net Recall:", recall_enet)
scikit-learn
\[ \log(\text{spend}_{i}) = \beta_0 + \beta_1 X_{1,i} +\,\cdots\,+ \beta_{1000} X_{1000,i} + \epsilon_i \] - \(\text{spend}_{i}\): household \(i\)’s amount of dollars spent on online shopping - \(X_{p, i}\) household \(i\)’s percentage of visiting the \(p\) website
from google.colab import data_table
data_table.enable_dataframe_formatter() # Enabling an interactive DataFrame display
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import scale # zero mean & one s.d.
from sklearn.linear_model import LassoCV, lasso_path
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
df = pd.read_csv("https://bcdanl.github.io/data/browser-online-shopping.zip")
X = df.drop('spend', axis = 1)
y = df['spend']
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train_np = X_train.values
X_test_np = X_test.values
y_train_np = y_train.values
y_test_np = y_test.values
# Best alpha
print("LassoCV - Best alpha:", lasso_cv.alpha_)
# Create a DataFrame including the intercept and the coefficients:
coef_lasso = pd.DataFrame({
'predictor': list(X_train.columns),
'coefficient': list(lasso_cv.coef_),
'exp_coefficient': np.exp( list(lasso_cv.coef_) )
})
# Evaluate
y_pred_lasso = lasso_cv.predict(X_test)
mse_lasso = mean_squared_error(y_test, y_pred_lasso)
print("LassoCV - MSE:", mse_lasso)
coef_lasso = coef_lasso.query('coefficient != 0')
coef_lasso.shape[0]
coef_lasso.sort_values('coefficient', ascending = False)
# Compute the mean and standard deviation of the CV errors for each alpha.
mean_cv_errors = np.mean(lasso_cv.mse_path_, axis=1)
std_cv_errors = np.std(lasso_cv.mse_path_, axis=1)
plt.figure(figsize=(8, 6))
plt.errorbar(lasso_cv.alphas_, mean_cv_errors, yerr=std_cv_errors, marker='o', linestyle='-', capsize=5)
plt.xscale('log')
plt.xlabel('Alpha')
plt.ylabel('Mean CV Error (MSE)')
plt.title('Cross-Validation Error vs. Alpha')
plt.axvline(x=lasso_cv.alpha_, color='red', linestyle='--', label='Best alpha')
plt.legend()
plt.show()
# Compute the lasso path. Note: we use np.log(y_train) because that's what you used in LassoCV.
alphas, coefs, _ = lasso_path(X_train, np.log(y_train), alphas=lasso_cv.alphas_, max_iter=100000)
plt.figure(figsize=(8, 6))
# Iterate over each predictor and plot its coefficient path.
for i, col in enumerate(X_train.columns):
plt.plot(alphas, coefs[i, :], label=col)
plt.xscale('log')
plt.xlabel('Alpha')
plt.ylabel('Coefficient value')
plt.title('Lasso Coefficient Paths')
plt.axvline(x=lasso_cv.alpha_, color='red', linestyle='--', label='Best alpha')
plt.show()
# Compute the coefficient path over the alpha grid that LassoCV used
alphas, coefs, _ = lasso_path(X_train, np.log(y_train),
alphas=lasso_cv.alphas_,
max_iter=100000)
# Count nonzero coefficients for each alpha (coefs shape: (n_features, n_alphas))
nonzero_counts = np.sum(coefs != 0, axis=0)
# Plot the number of nonzero coefficients versus alpha
plt.figure(figsize=(8,6))
plt.plot(alphas, nonzero_counts, marker='o', linestyle='-')
plt.xscale('log')
plt.xlabel('Alpha')
plt.ylabel('Number of nonzero coefficients')
plt.title('Nonzero Coefficients vs. Alpha')
#plt.gca().invert_xaxis() # Lower alphas (less regularization) on the right
plt.axvline(x=lasso_cv.alpha_, color='red', linestyle='--', label='Best alpha')
plt.legend()
plt.show()
The data comprise of play-by-play NHL game data for regular and playoff games during 11 seasons of 2002-2003 through 2013-2014.
There were p = 2,439 players involved in n = 69,449 goals.
The data contains information that indicates seasons, home & away teams, team configuration such as 5 on 4 powerplay, and which players are on & off the ice when a goal is made, etc.
Unfortunately, Python scikit-learn
is not optimized to handle all season’s data.
scikit-learn
is handling one season each.glmnet
for regression with cross-validation.homegoal
: an indicator (0 or 1) for the home team scoringplayer_name
: entries for who was on the ice for each goalteam
: indicators for each teamconfig
: Special teams info. E.g., S5v4
is a 5 on 4 powerplayconfig
, team
, and player_name
are:
\[ \begin{align} \log\left(\frac{\text{Prob}(\text{home-goal}_{i})}{\text{Prob}(\text{away-goal}_{i})}\right) &= \beta_0 + \sum_{j=1}^{J} \beta_{\text{team}_{j}}\text{team}_{j, i} + \sum_{k=1}^{K} \beta_{\text{config}_{k}}\text{config}_{k, i} \\ &\qquad + \sum_{m=1}^{M} \beta_{\text{home-player}_{m}}\text{home-player}_{m, i}\\ &\qquad - \sum_{n=1}^{N} \beta_{\text{away-player}_{n}}\text{away-player}_{n, i} \end{align} \]
Omitted variable bias (OVB): bias in the model because of omitting an important predictor that is correlated with existing predictor(s).
Let’s use an orange juice (OJ) example to demonstrate the OVB.
brand
or ad_status
OVB is the difference in beta estimates between short- and long-form regressions.
Short regression: The regression model with less predictors
\[ \begin{align} \log(\text{sales}_i) = \beta_0 + \beta_1\log(\text{price}_i) + \epsilon_i \end{align} \]
\[ \begin{align} \log(\text{sales}_i) =& \beta_0 + \beta_{1}\log(\text{price}_i) \\ &+ \beta_{2}\text{minute.maid}_i + \beta_{3}\text{tropicana}_i + \epsilon_i \end{align} \] - OVB for \(\beta_1\) is:
\[ \text{OVB} = \widehat{\beta_{1}^{short}} - \widehat{\beta_{1}^{long}} \]
Consider the following short- and long- regressions:
Error in short form can be represented as: \[ {\epsilon_{short}} = \beta_{2}X_2 + \epsilon_{long} \]
If variable \(X_1\) is correlated with \(X_2\), the following assumptions are violated in the short regression model:
price
and brand
:\[ \log(\text{price}) = \beta_0 + \beta_1\text{minute_maid} + \beta_2\text{tropicana} + \epsilon_{1st} \]
Then, calculate the residual: \[ \widehat{\epsilon_{1st}} = \log(\text{price}) - \widehat{\log(\text{price})} \]
The residual represents the log of OJ price after its correlation with brand has been removed!
In the second stage, regress \(\log(\text{sales})\) on residual \(\widehat{\epsilon_{1st}}\):
\[ \log(\text{sales}) = \beta_0 + \beta_1\widehat{\epsilon_{1st}} + \epsilon_{2nd} \]
Regression finds the coefficients on the part of each predictor that is independent from the other predictors.
What can we do to deal with OVB problems?