Lecture 2

Omitted Variable Bias; Bad Controls

Byeong-Hak Choe

SUNY Geneseo

January 26, 2026

Omitted Variable Bias

Omitted Variable Bias

  • Omitted variable bias (OVB): bias in the model because of omitting an important regressor that is correlated with existing regressor(s).

  • Let’s use an orange juice (OJ) example to demonstrate the OVB.

    • OJ price elasticity estimates vary with models, whether or not taking into account brand or ad_status
Variable Description
sales Quantity of OJ cartons sold
price Price of OJ
brand Brand of OJ
ad Advertisement status

📏 Short- and Long- Regressions

  • OVB is the difference in beta estimates between short- and long-form regressions.

  • Short regression: The regression model with less regressors

\[ \begin{align} \log(\text{sales}_i) = \beta_0 + \beta_1\log(\text{price}_i) + \epsilon_i \end{align} \]

  • Long regression: The regression model that adds additional regressor(s) to the short one.

\[ \begin{align} \log(\text{sales}_i) =& \beta_0 + \beta_{1}\log(\text{price}_i) \\ &+ \beta_{2}\text{minute.maid}_i + \beta_{3}\text{tropicana}_i + \epsilon_i \end{align} \]

  • OVB for \(\beta_1\) is:

\[ \text{OVB} = \widehat{\beta_{1}^{short}} - \widehat{\beta_{1}^{long}} \]

OVB formula

  • Consider the following short- and long- regressions:

    • Short: \(Y_i = \beta_0 + \beta_{1}^{short}X_1 + \epsilon_{short}\)
    • Long: \(Y_i = \beta_0 + \beta_{1}^{long}X_1 +\beta_{2}X_2 + \epsilon_{long}\)
  • Error in short form can be represented as: \[ {\epsilon_{short}} = \beta_{2}X_2 + \epsilon_{long} \]

  • If variable \(X_1\) is correlated with \(X_2\), the following assumptions are violated in the short regression model:

    • Errors are not correlated with regressors.
    • Errors have a mean value of 0.

❓🔍 How does an OVB happen in regression?

  • In the first stage, consider the relationship between price and brand:

\[ \log(\text{price}) = \beta_0 + \beta_1\text{minute_maid} + \beta_2\text{tropicana} + \epsilon_{1st} \]

  • Then, calculate the residual: \[ \widehat{\epsilon_{1st}} = \log(\text{price}) - \widehat{\log(\text{price})} \]

  • The residual represents the log of OJ price after its correlation with brand has been removed!

  • In the second stage, regress \(\log(\text{sales})\) on residual \(\widehat{\epsilon_{1st}}\):

\[ \log(\text{sales}) = \beta_0 + \beta_1\widehat{\epsilon_{1st}} + \epsilon_{2nd} \]

Regression Sensitivity Analysis

  • Regression finds the coefficients on the part of each regressor that is independent from the other regressors.

  • What can we do to deal with OVB problems?

    • Because we can never be sure whether a given set of controls is enough to eliminate OVB, it’s important to ask how sensitive regression results are to changes in the list of controls.

🚫 Bad Controls

🚫 Bad Controls: When “Controlling More” Makes Things Worse

A common mistake in regression is thinking:

“If I add more controls, the model must get better.”

But some controls are bad controls — they can actually bias your estimate.

🚫 What is a “Bad Control”?

A bad control is a variable that:

  • is influenced by the explanatory variable \(x\), or
  • is part of the process that links \(x\) to the outcome \(y\)

Including a bad control can “control away” the variation in \(y\) that is connected to \(x\), making the estimated relationship look smaller than it really is.

🧠 Intuition: What Should a “Good Control” Look Like?

A good control is a variable that:

✅ helps explain variation in the outcome \(y\)
✅ is related to the explanatory variable \(x\)
✅ is not a consequence of \(x\) (it exists “in the background”)

That’s the key difference.

🔗 Example 1: Education and Wages 💼

Suppose we want to estimate:

Does education increase wages?

✅ Good regression idea:

\[ wage_i = \beta_0 + \beta_1 education_i + u_i \]

Maybe add good controls like:

  • age
  • gender
  • region
  • parent education (if available)

🚫 Bad Control: Occupation or Job Type

But what if we control for occupation?

\[ wage_i = \beta_0 + \beta_1 education_i + \beta_2 occupation_i + u_i \]

Why is that bad?

Because education affects occupation:

\[ education \rightarrow occupation \rightarrow wage \]

So controlling for occupation:

🚫 blocks part of the education effect
→ we underestimate the impact of education

🌱 Example 2: Fertilizer, Plant Height, and Crop Yield

We want to estimate:

Does fertilizer increase crop yield?

But fertilizer changes the plant’s growth:

\[ fertilizer \rightarrow height \]

And height is strongly related to yield:

\[ height \rightarrow yield \]

Also, some plots are simply better than others because of soil quality:

\[ \begin{align} &soil\ quality \rightarrow height\\ &soil\ quality \rightarrow yield \end{align} \]

So soil quality is a hidden factor that affects both height and yield.

🚫 Bad Control: Plant Height

But what if we control for plant height?

\[ yield_i = \beta_0 + \beta_1 fertilizer_i + \beta_2 height_i + u_i \]

Why is that bad?

Because fertilizer affects plant height:

\[ fertilizer \rightarrow height \rightarrow yield \]

So controlling for height:

🚫 blocks an important pathway through which fertilizer raises yield
→ we underestimate the impact of fertilizer

✅ Good Controls vs. Bad Controls (Big Picture)

Type Example Why it matters
✅ Good control Age, gender, region, baseline ability Helps reduce omitted variable bias by holding constant pre-existing differences
🚫 Bad control Occupation (when studying education → wages) Can remove part of the education effect by controlling for something education influences
🚫 Bad control Plant height (when studying fertilizer → yield) Can remove part of the fertilizer effect by controlling for something fertilizer influences

🎯 Why Bad Controls Are Dangerous

Bad controls can cause two problems:

1) ❌ Block part of the true effect

You unintentionally remove the pathway:

\[ x \rightarrow control \rightarrow y \]

So the estimated \(\hat{\beta}_1\) becomes too small.

2) ❌ Create new bias

Sometimes the control is influenced by both \(x\) and \(y\):

\[ x \rightarrow control \leftarrow y \]

Then controlling for it creates a fake association between \(x\) and \(y\)
even if none existed before.

✅ Quick Rule: Should We Control for It?

Before adding a variable \(z\) as a control, ask:

  • Does \(z\) exist before \(x\) is determined?
    ✅ If yes → usually safe to control for
  • Does \(z\) change after \(x\) changes?
    🚫 If yes → it may be a bad control
  • Is \(z\) a pre-existing characteristic that is related to both \(x\) and \(y\)?
    ✅ If yes → it is often a useful control

🧾 Summary: “More Controls” ≠ “Better”

  • ✅ Controls are helpful when they fix omitted variable bias
  • 🚫 Controls are harmful when they “control away” the effect you want
  • The goal is not to control for everything.
  • The goal is to control for the right things.