Omitted Variable Bias; Bad Controls
January 26, 2026
Omitted variable bias (OVB): bias in the model because of omitting an important regressor that is correlated with existing regressor(s).
Let’s use an orange juice (OJ) example to demonstrate the OVB.
brand or ad_status| Variable | Description |
|---|---|
sales |
Quantity of OJ cartons sold |
price |
Price of OJ |
brand |
Brand of OJ |
ad |
Advertisement status |
OVB is the difference in beta estimates between short- and long-form regressions.
Short regression: The regression model with less regressors
\[ \begin{align} \log(\text{sales}_i) = \beta_0 + \beta_1\log(\text{price}_i) + \epsilon_i \end{align} \]
\[ \begin{align} \log(\text{sales}_i) =& \beta_0 + \beta_{1}\log(\text{price}_i) \\ &+ \beta_{2}\text{minute.maid}_i + \beta_{3}\text{tropicana}_i + \epsilon_i \end{align} \]
\[ \text{OVB} = \widehat{\beta_{1}^{short}} - \widehat{\beta_{1}^{long}} \]
Consider the following short- and long- regressions:
Error in short form can be represented as: \[ {\epsilon_{short}} = \beta_{2}X_2 + \epsilon_{long} \]
If variable \(X_1\) is correlated with \(X_2\), the following assumptions are violated in the short regression model:
price and brand:\[ \log(\text{price}) = \beta_0 + \beta_1\text{minute_maid} + \beta_2\text{tropicana} + \epsilon_{1st} \]
Then, calculate the residual: \[ \widehat{\epsilon_{1st}} = \log(\text{price}) - \widehat{\log(\text{price})} \]
The residual represents the log of OJ price after its correlation with brand has been removed!
In the second stage, regress \(\log(\text{sales})\) on residual \(\widehat{\epsilon_{1st}}\):
\[ \log(\text{sales}) = \beta_0 + \beta_1\widehat{\epsilon_{1st}} + \epsilon_{2nd} \]
Regression finds the coefficients on the part of each regressor that is independent from the other regressors.
What can we do to deal with OVB problems?
A common mistake in regression is thinking:
✅ “If I add more controls, the model must get better.”
But some controls are bad controls — they can actually bias your estimate.
A bad control is a variable that:
Including a bad control can “control away” the variation in \(y\) that is connected to \(x\), making the estimated relationship look smaller than it really is.
A good control is a variable that:
✅ helps explain variation in the outcome \(y\)
✅ is related to the explanatory variable \(x\)
✅ is not a consequence of \(x\) (it exists “in the background”)
That’s the key difference.
Suppose we want to estimate:
Does education increase wages?
\[ wage_i = \beta_0 + \beta_1 education_i + u_i \]
Maybe add good controls like:
But what if we control for occupation?
\[ wage_i = \beta_0 + \beta_1 education_i + \beta_2 occupation_i + u_i \]
Why is that bad?
Because education affects occupation:
\[ education \rightarrow occupation \rightarrow wage \]
So controlling for occupation:
🚫 blocks part of the education effect
→ we underestimate the impact of education
We want to estimate:
Does fertilizer increase crop yield?
But fertilizer changes the plant’s growth:
\[ fertilizer \rightarrow height \]
And height is strongly related to yield:
\[ height \rightarrow yield \]
Also, some plots are simply better than others because of soil quality:
\[ \begin{align} &soil\ quality \rightarrow height\\ &soil\ quality \rightarrow yield \end{align} \]
So soil quality is a hidden factor that affects both height and yield.
But what if we control for plant height?
\[ yield_i = \beta_0 + \beta_1 fertilizer_i + \beta_2 height_i + u_i \]
Why is that bad?
Because fertilizer affects plant height:
\[ fertilizer \rightarrow height \rightarrow yield \]
So controlling for height:
🚫 blocks an important pathway through which fertilizer raises yield
→ we underestimate the impact of fertilizer
| Type | Example | Why it matters |
|---|---|---|
| ✅ Good control | Age, gender, region, baseline ability | Helps reduce omitted variable bias by holding constant pre-existing differences |
| 🚫 Bad control | Occupation (when studying education → wages) | Can remove part of the education effect by controlling for something education influences |
| 🚫 Bad control | Plant height (when studying fertilizer → yield) | Can remove part of the fertilizer effect by controlling for something fertilizer influences |
Bad controls can cause two problems:
You unintentionally remove the pathway:
\[ x \rightarrow control \rightarrow y \]
So the estimated \(\hat{\beta}_1\) becomes too small.
Sometimes the control is influenced by both \(x\) and \(y\):
\[ x \rightarrow control \leftarrow y \]
Then controlling for it creates a fake association between \(x\) and \(y\)
even if none existed before.
Before adding a variable \(z\) as a control, ask: