K-fold Cross-Validation; Bias-Variance Trade-off; Regularized Regression
February 25, 2026
Partitioning Data for 3-fold Cross-Validation
Training Data:
The portion of data used to fit the model.
Within this set, k‑fold cross-validation is applied.
Validation Data:
Temporary splits within the training set during cross-validation, used to tune hyperparameters and assess performance.
Test Data:
A held-out dataset that is never used during model tuning.
Provides an unbiased evaluation of the final model’s performance.
Workflow:
For a prediction \(\hat{y}\), the expected test MSE can be written as
\[ \text{(Error)}^{2} = \text{(Bias)}^{2} + \text{(Variance)} + \text{(Irreducible Noise)} \]
Warning
Regularized regression can resolve the following problems:
The above situations usually happen when the model is too complex (e.g., has large or many beta variables).
We will discuss three regularized regression methods:
When variables are nearly collinear, lasso regression tends to drive one or more of them to zero.
In the regression of \(\text{income}\), lasso regression might give zero credit to one of the two variables, \(\texttt{age}\) and \(\texttt{years_of_workforce}\).
For this reason, lasso regression is often used as a form of model/variable selection.
Elastic net regression tries to find the beta parameters \(\beta_0, \beta_1, \beta_2, \,\cdots\, \beta_{p}\), \(\alpha\), and \(\lambda\) such that \[ f(x_i) = b_0 + b_1 x_{1,i} + b_2 x_{2,i} + \,\cdots\, + b_p x_{p,i} \] is as close as possible to \(y_i\) for all the training data by minimizing the sum of the squared error (SSE) plus a linear combination of the ridge and the lasso penalties with the \(\alpha\) parameter: \[ \begin{align} &(y_1 - f(x_1))^2 + (y_2 - f(x_2))^2 + \,\cdots\, + (y_N - f(x_N))^2 \\ &+ \alpha \times \lambda \times(| \beta_1 | + |\beta_2 | + \,\cdots\, + |\beta_{p}|)\\ &+ (1-\alpha)\times \lambda \times(\beta_1^2 + \beta_2^2 + \,\cdots\, + \beta_{p}^2) \end{align} \] where \(0 \leq \alpha \leq 1\).
When \(\alpha = 0\), this reduces to Lasso regression.
When \(\alpha = 1\), this reduces to Ridge regression.
R’s \(\texttt{glmnet}\) package uses a sparse matrix
A sparse matrix is a matrix with many zero entries
A sparse matrix is almost essential in big data analysis because of its lower storage costs and faster computation.
Omitted variable bias (OVB): bias in the model because of omitting an important regressor that is correlated with existing regressor(s).
Let’s use an orange juice (OJ) example to demonstrate the OVB.
brand or ad_status| Variable | Description |
|---|---|
sales |
Quantity of OJ cartons sold |
price |
Price of OJ |
brand |
Brand of OJ |
ad |
Advertisement status |
OVB is the difference in beta estimates between short- and long-form regressions.
Short regression: The regression model with less regressors
\[ \begin{align} \log(\text{sales}_i) = \beta_0 + \beta_1\log(\text{price}_i) + \epsilon_i \end{align} \]
\[ \begin{align} \log(\text{sales}_i) =& \beta_0 + \beta_{1}\log(\text{price}_i) \\ &+ \beta_{2}\text{minute.maid}_i + \beta_{3}\text{tropicana}_i + \epsilon_i \end{align} \]
\[ \text{OVB} = \widehat{\beta_{1}^{short}} - \widehat{\beta_{1}^{long}} \]
Consider the following short- and long- regressions:
Error in short form can be represented as: \[ {\epsilon_{short}} = \beta_{2}X_2 + \epsilon_{long} \]
If variable \(X_1\) is correlated with \(X_2\), the following assumptions are violated in the short regression model:
price and brand:\[ \log(\text{price}) = \beta_0 + \beta_1\text{minute_maid} + \beta_2\text{tropicana} + \epsilon_{1st} \]
Then, calculate the residual: \[ \widehat{\epsilon_{1st}} = \log(\text{price}) - \widehat{\log(\text{price})} \]
The residual represents the log of OJ price after its correlation with brand has been removed!
In the second stage, regress \(\log(\text{sales})\) on residual \(\widehat{\epsilon_{1st}}\):
\[ \log(\text{sales}) = \beta_0 + \beta_1\widehat{\epsilon_{1st}} + \epsilon_{2nd} \]
Regression finds the coefficients on the part of each predictor that is independent from the other predictors.
What can we do to deal with OVB problems?