
Linear Regression
January 26, 2026
Big in both the number of observations (size n) and in the number of variables (dimension p).
In these settings, we cannot:
Some ML tools are straight out of previous statistics classes (linear regression) and some are totally new (ensemble models, principal component analysis).
n and p get really big.Linear regression assumes a linear relationship for \(Y = f(X_{1})\): \[Y_{i} \,=\, \beta_{0} \,+\, \beta_{1} X_{1, i} \,+\, \epsilon_{i}\] for \(i \,=\, 1, 2, \dots, n\), where \(i\) is the \(i\)-th observation in data.
\(Y_i\) is the \(i\)-th value for the outcome/dependent/response/target variable \(Y\).
\(X_{1, i}\) is the \(i\)-th value for the explanatory/independent/predictor/input variable or feature \(X_{1}\).
Linear regression assumes a linear relationship for \(Y = f(X_{1})\): \[Y_{i} \,=\, \beta_{0} \,+\, \beta_{1} X_{1, i} \,+\, \epsilon_{i}\] for \(i \,=\, 1, 2, \dots, n\), where \(i\) is the \(i\)-th observation in data.
\(\beta_0\) is an unknown true value of an intercept: average value for \(Y\) if \(X_{1} = 0\)
\(\beta_1\) is an unknown true value of a slope: increase in average value for \(Y\) for each one-unit increase in \(X_{1}\)
Linear regression assumes a linear relationship for \(Y = f(X_{1})\): \[Y_{i} \,=\, \beta_{0} \,+\, \beta_{1} X_{1, i} \,+\, \epsilon_{i}\] for \(i \,=\, 1, 2, \dots, n\), where \(i\) is the \(i\)-th observation in data.
\(\epsilon_i\) is a random noise, or a statistical error: \[ \epsilon_i \sim N(0, \sigma^2) \]
Linear regression finds the beta estimates \(( \hat{\beta_{0}}, \hat{\beta_{1}} )\) such that:
– The linear function \(f(X_{1}) = \hat{\beta_{0}} + \hat{\beta_{1}}X_{1}\) is as near as possible to \(Y\) for all \((X_{1, i}\,,\, Y_{i})\) pairs in the data.
We use the hat notation \((\,\hat{\texttt{ }_{\,}}\,)\) to distinguish true values and estimated/predicted values.
The value of true beta coefficient is denoted by \(\beta_{1}\).
The value of estimated beta coefficient is denoted by \(\hat{\beta_{1}}\).
The \(i\)-th value of true outcome variable is denoted by \(Y_{i}\).
The \(i\)-th value of predicted outcome variable is denoted by \(\hat{Y_{i}}\).
1. Finding the relationship between \(X_{1}\) and \(Y\) \[\hat{\beta_{1}}\]: How is an increase in \(X_1\) by one unit associated with a change in \(Y\) on average?
2. Making a prediction on \(Y\): \[\hat{Y_{\,}}\] For unseen data point of \(X_1\), what is the predicted value of outcome, \(\hat{Y_{\,}}\)?
i, we want to predict sale_price[i] based on gross_square_feet[i].gross_square_feet[i] is associated with sale_price[i].\[\texttt{sale_price[i]} \;=\quad \texttt{b0} \,+\, \texttt{b1*gross_square_feet[i]} \,+\, \texttt{e[i]}\]
Linear regression assumes that:
sale_price[i] is linearly related to the input gross_square_feet[i]:where e[i] is a statistical error term.
sale_price and gross_square_feet\[ \begin{align} MSE &= SSR\, / \, n\\ { }\\ SSR &\,=\, (\texttt{Residual_Error}_{1})^{2}\\ &\quad \,+\, (\texttt{Residual_Error}_{2})^{2}\\ &\quad\,+\, \cdots + (\texttt{Residual_Error}_{n})^{2} \end{align} \]
R-squared is a measure of how well the model “fits” the data, or its “goodness of fit.”
y’s variation is explained by the explanatory variables.We want R-squared to be fairly large and R-squareds that are similar on testing and training.
Caution: R-squared will be higher for models with more explanatory variables, regardless of whether the additional explanatory variables actually improve the model or not.
gross_square_feet and sale_price by estimating a true value of b1.b1 is denoted by \(\hat{\texttt{b1}}\).sale_price[i] for new property isale_price[i] is denoted by \(\widehat{\texttt{sale_price}}\texttt{[i]}\), where\[\widehat{\texttt{sale_price}}\texttt{[i]} \;=\quad \hat{\texttt{b0}} \,+\, \hat{\texttt{b1}}\texttt{*gross_square_feet[i]}\]
Training data: When we’re building a linear regression model, we need data to train the model.
Test data: We also need data to test whether the model works well on new data.

The probability density function for the uniform distribution looks like:
With the uniform distribution, any values of \(x\) between 0 and 1 is equally likely drawn.

sale_price > 10^6 are in the training data and the observations with sale_price <= 10^6 are in the test data.
We will use the data for residential property sales from September 2017 and August 2018 in NYC.
Each sales data recorded contains a number of interesting variables, but here we focus on the followings:
sale_price: a property’s sales price;gross_square_feet: a property’s size;age: a property’s age;borough_name: a borough where a property is located.Use summary statistics and visualization to explore the data.
Call:
lm(formula = sale_price ~ gross_square_feet, data = dtrain)
Residuals:
Min 1Q Median 3Q Max
-1677677 -208583 -41403 135661 8667468
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -36232.101 14615.843 -2.479 0.0132 *
gross_square_feet 460.373 8.957 51.400 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 461900 on 7640 degrees of freedom
Multiple R-squared: 0.2569, Adjusted R-squared: 0.2569
F-statistic: 2642 on 1 and 7640 DF, p-value: < 2.2e-16
stargazerstargazer
Linear regression assumes a linear relationship for \(Y = f(X_{1}, X_{2})\): \[Y_{i} \,=\, \beta_{0} \,+\, \beta_{1} X_{1, i} \,+\,\beta_{2} X_{2, i} \,+\, \epsilon_{i}\] for \(i \,=\, 1, 2, \dots, n\), where \(i\) is the \(i\)-th observation in data.
\(\beta_0\) is an unknown true value of an intercept: average value for \(Y\) if \(X_{1} = 0\) and \(X_{2} = 0\)
\(\beta_1\) is an unknown true value of a slope: increase in average value for \(Y\) for each one-unit increase in \(X_{1}\)
\(\beta_2\) is an unknown true value of a slope: increase in average value for \(Y\) for each one-unit increase in \(X_{2}\)
Linear regression assumes a linear relationship for \(Y = f(X_{1}, X_{2})\): \[Y_{i} \,=\, \beta_{0} \,+\, \beta_{1} X_{1, i}\,+\, \beta_{1} X_{2, i} \,+\, \epsilon_{i}\] for \(i \,=\, 1, 2, \dots, n\), where \(i\) is the \(i\)-th observation in data.
\(\epsilon_i\) is a random noise, or a statistical error: \[ \epsilon_i \sim N(0, \sigma^2) \]
gross_square_feet by one unit is associated with an increase in sale_price by \(\hat{\beta_{1}}\).Linear regression models require numerical predictors, but many variables are categorical.
The Approach:
Convert categorical variables into numerical format using dummy variables.
Why Do This?
Definition: Binary indicators (0 or 1) representing categories.
Purpose: Transform qualitative data into a quantitative form for regression analysis.
Example: \[ D_i = \begin{cases} 1, & \text{if the observation belongs to the category} \\ 0, & \text{otherwise} \end{cases} \]
Consider a regression model including a dummy variable: \[ y_i = \beta_0 + \beta_1 x_i + \beta_2 D_i + \epsilon_i \]
\(x_i\): A continuous predictor.
\(D_i\): Dummy variable (e.g., political party affiliation, type of car).
Interpretation: \(\beta_2\) captures the difference in the response \(y\) when the category is present (i.e., \(D_i=1\)) versus absent.
Multicollinearity
Problem: For a categorical variable with \(k\) levels, using all \(k\) dummies causes perfect multicollinearity:
This is problematic, because one dummy is completely predictable from the others.
The intercept already captures the constant part (1), making one of the dummy variables redundant.
Solution: Drop one dummy (choose a reference category)
The reference category is represented by a combination of \(\texttt{borough}\) variables.
Proper model: \[ y_i = \beta_0 + \beta_1 D_{1, i} + \beta_2 D_{2, i} + \cdots + \beta_{k-1} D_{(k-1), i} + \epsilon_i \]
Interpretation:
In R, factors automatically generate dummies and avoid the dummy variable trap.
# Set reference level (omit dummy for Manhattan)
dtrain <- dtrain |>
mutate(borough_name = factor(borough_name)) |>
mutate(borough_name = relevel(borough_name, ref = "Manhattan"))
dtest <- dtest |>
mutate(borough_name = factor(borough_name)) |>
mutate(borough_name = relevel(borough_name, ref = "Manhattan"))
m3 <- lm(sale_price ~ gross_square_feet + age + borough_name, data = dtrain)
stargazer(
m3,
type = "html",
title = "Multivariate Regression with Borough Dummies (Reference = Manhattan)",
dep.var.labels = "sale_price",
digits = 3
)\[ \epsilon_i \sim N(0, \sigma^2) \]
Errors have a mean value of 0 with constant variance \(\sigma^2\).
Errors are uncorrelated with \(X_{1,i}\) and with \(X_{2, i}\).
If we re-arrange the simple regression equation, \[\begin{align} {\epsilon}_{i} \,=\, Y_{i} \,-\, (\, {\beta}_{0} \,+\, {\beta}_{1}X_{1,i} \,). \end{align}\]
\(\texttt{residual_error}_{i}\) can be thought of as the expected value of \(\epsilon_{i}\), denoted by \(\hat{\epsilon_{i}}\).
\[ \begin{align} \hat{\epsilon_{i}} \,=\, &Y_i \,-\, \hat{Y}_i\\ \,=\, &Y_{i} \,-\, (\, \hat{\beta_{0}} \,+\, \hat{\beta_{1}}X_{1,i} \,) \end{align} \]
Residual plot: scatterplot of
A residual plot can be used to diagnose the quality of model results.
We assume that \(\epsilon_{i}\) have a mean value of 0 with constant variance \(\sigma^2\):
Unbiased: mean residual is ~0 within thin vertical strips
Homoskedastic: similar spread of residuals across fitted values
If residual variance changes with fitted values ⇒ heteroscedasticity
aug_m2 <- augment(m2) # broom::augment()
ggplot(aug_m2, aes(x = .fitted, y = .resid)) +
geom_point(alpha = 0.25) +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_smooth(se = FALSE, method = "loess") +
labs(
x = "Fitted values (Model 2)",
y = "Residuals",
title = "Residual Plot for Model 2"
)To determine whether an independent variable has a statistically significant effect on the dependent variable in a linear regression model.
Consider the following linear regression model: \[ y_{i} = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + \dots + \beta_k x_{k,i} + \epsilon_{i} \]
\(y\): Outcome
\(x_1, x_2, \dots, x_k\): Predictors
\(\beta_0\): Intercept
\(\beta_1, \beta_2, \dots, \beta_k\): Coefficients
\(\epsilon_{i}\): Error term
We test whether a specific coefficient \(\beta_j\) significantly differs from zero:
The t-statistic is used to test each coefficient: \[ t = \frac{\hat{\beta_j} - 0}{SE(\hat{\beta_j})} \]
\(\hat{\beta_j}\): Estimated coefficient
\(SE(\hat{\beta_j})\): Standard error of the estimate
* (10%); ** (5%); *** (1%)The model equation is \[\begin{align}
\texttt{sale_price[i]} \;=\;\, &\texttt{b0} \,+\,\\ &\texttt{b1*gross_square_feet[i]} \,+\,\texttt{b2*age[i]}\,+\,\\ &\texttt{b3*Bronx[i]} \,+\,\texttt{b4*Brooklyn[i]} \,+\,\\&\texttt{b5*Queens[i]} \,+\,\texttt{b6*Staten Island[i]}\,+\,\\ &\texttt{e[i]}
\end{align}\] - The reference level of borough_name variables is Manhattan.
gross_square_feetA and B.
A and B are in Bronx and with the same age.gross_square_feet of house A is 2001, while that of house B is 2000.gross_square_feet by one unit is associated with an increase in sale_price by \(\hat{\beta_{1}}\).
gross_square_feet\[ \begin{align}\widehat{\texttt{sale_price[A]}} \;=\quad& \hat{\texttt{b0}} \,+\, \hat{\texttt{b1}}\texttt{*gross_square_feet[A]} \,+\, \hat{\texttt{b2}}\texttt{*age[A]} \,+\,\\ &\hat{\texttt{b3}}\texttt{*Bronx[A]}\,+\,\hat{\texttt{b4}}\texttt{*Brooklyn[A]} \,+\,\\ &\hat{\texttt{b5}}\texttt{*Queens[A]}\,+\, \hat{\texttt{b6}}\texttt{*Staten Island[A]}\\ \widehat{\texttt{sale_price[B]}} \;=\quad& \hat{\texttt{b0}} \,+\, \hat{\texttt{b1}}\texttt{*gross_square_feet[B]} \,+\, \hat{\texttt{b2}}\texttt{*age[B]}\,+\,\\ &\hat{\texttt{b3}}\texttt{*Bronx[B]}\,+\, \hat{\texttt{b4}}\texttt{*Brooklyn[B]} \,+\,\\ &\hat{\texttt{b5}}\texttt{*Queens[B]}\,+\, \hat{\texttt{b6}}\texttt{*Staten Island[B]} \end{align} \]
\[ \begin{align}\Leftrightarrow\qquad&\widehat{\texttt{sale_price[A]}} \,-\, \widehat{\texttt{sale_price[B]}}\qquad \\ \;=\quad &\hat{\texttt{b1}}\texttt{*}(\texttt{gross_square_feet[A]} - \texttt{gross_square_feet[B]})\\ \;=\quad &\hat{\texttt{b1}}\texttt{*}\texttt{(2001 - 2000)} \,=\, \hat{\texttt{b1}}\qquad\qquad\quad\;\; \end{align} \]
borough_nameBronxConsider the predicted sales prices of the two houses, A and C.
A and C are with the same age and the same gross_square_feet.A is in Bronx, and C is in Manhattan.All else being equal, an increase in borough_nameBronx by one unit is associated with an increase in sale_price by b3.
Equivalently, all else being equal, being in Bronx relative to being a in Manhattan is associated with a decrease in sale_price by |b3|.
borough_nameBronx\[ \begin{align}\widehat{\texttt{sale_price[A]}} \;=\quad& \hat{\texttt{b0}} \,+\, \hat{\texttt{b1}}\texttt{*gross_square_feet[A]} \,+\, \hat{\texttt{b2}}\texttt{*age[A]} \,+\,\\ &\hat{\texttt{b3}}\texttt{*Bronx[A]}\,+\, \hat{\texttt{b4}}\texttt{*Brooklyn[A]} \,+\,\\ &\hat{\texttt{b5}}\texttt{*Queens[A]}\,+\, \hat{\texttt{b6}}\texttt{*Staten Island[A]}\\ \widehat{\texttt{sale_price[C]}} \;=\quad& \hat{\texttt{b0}} \,+\, \hat{\texttt{b1}}\texttt{*gross_square_feet[C]} \,+\, \hat{\texttt{b2}}\texttt{*age[C]}\,+\,\\ &\hat{\texttt{b3}}\texttt{*Bronx[C]}\,+\, \hat{\texttt{b4}}\texttt{*Brooklyn[C]} \,+\,\\ &\hat{\texttt{b5}}\texttt{*Queens[C]}\,+\, \hat{\texttt{b6}}\texttt{*Staten Island[C]} \end{align} \]
\[ \begin{align}\Leftrightarrow\qquad&\widehat{\texttt{sale_price[A]}} \,-\, \widehat{\texttt{sale_price[C]}}\qquad \\ \;=\quad &\hat{\texttt{b3}}\texttt{*}\texttt{Bronx[A]} \\ \;=\quad &\hat{\texttt{b3}}\qquad\qquad\qquad\qquad\quad\;\;\;\,\end{align} \]
coef_df <- tidy(m3, conf.int = TRUE)
ggplot(coef_df |> filter(term != "(Intercept)"),
aes(x = reorder(term, estimate), y = estimate, ymin = conf.low, ymax = conf.high)) +
geom_pointrange() +
geom_hline(yintercept = 0, linetype = "dashed") +
coord_flip() +
labs(
x = "Terms",
y = "Estimate (95% CI)",
title = "Coefficient Plot (Model with Borough Dummies)"
)The model equation with log-transformed \(\texttt{sale.price[i]}\) is \[\begin{align} \log(\texttt{sale.price[i]}) \;=\;\, &\texttt{b0} \,+\,\\ &\texttt{b1*gross.square.feet[i]} \,+\,\texttt{b2*age[i]}\,+\,\\ &\texttt{b3*Bronx[i]} \,+\,\texttt{b4*Brooklyn[i]} \,+\,\\&\texttt{b5*Queens[i]} \,+\,\texttt{b6*Staten Island[i]}\,+\,\\ &\texttt{e[i]}. \end{align}\]
borough_name is Manhattan.gross_square_feetgross_square_feet\[\begin{align}&\log(\widehat{\texttt{sale.price}}\texttt{[A]}) - \log(\widehat{\texttt{sale.price}}\texttt{[B]}) \\ \,=\, &\hat{\texttt{b1}}\,*\,(\texttt{gross.square.feet[A]} \,-\, \texttt{gross.square.feet[B]})\\ \,=\, &\hat{\texttt{b1}}\end{align}\]
So we can have the following: \[ \begin{align} &\Leftrightarrow\qquad\frac{\widehat{\texttt{sale.price[A]}}}{ \widehat{\texttt{sale.price[B]}}} \;=\; \texttt{exp(}\hat{\texttt{b1}}\texttt{)}\\ \quad&\Leftrightarrow\qquad\widehat{\texttt{sale.price[A]}} \;=\; \widehat{\texttt{sale.price[B]}} * \texttt{exp(}\hat{\texttt{b1}}\texttt{)} \end{align} \]
gross_square_feetborough_nameBronxA is in Bronx, and C is in Manhattan.A and C’s age are the same.A and C’s gross.square.feet are the same.borough_nameBronxlog()-exp() rules for \(\widehat{\texttt{sale.price}}\texttt{[A]}\) and \(\widehat{\texttt{sale.price}}\texttt{[C]}\),\[\begin{align}&\log(\widehat{\texttt{sale.price}}\texttt{[A]}) - \log(\widehat{\texttt{sale.price}}\texttt{[C]}) \\ \,=\, &\hat{\texttt{b3}}\,*\,(\texttt{borough_Bronx[A]} \,-\, \texttt{borough_Bronx[C]})\,=\, \hat{\texttt{b3}}\end{align}\]
So we can have the following: \[ \begin{align}&\Leftrightarrow\qquad\frac{\widehat{\texttt{sale.price[A]}}}{ \widehat{\texttt{sale.price[C]}}} \;=\; \texttt{exp(}\hat{\texttt{b3}}\texttt{)}\\ \quad&\Leftrightarrow\qquad\,\widehat{\texttt{sale.price[A]}} \;=\; \widehat{\texttt{sale.price[C]}} * \texttt{exp(}\hat{\texttt{b3}}\texttt{)} \end{align} \]
borough_nameBronxBronx relative to being in Manhattan is associated with a decrease in \(\texttt{sale.price}\) by 71.78%.Does the relationship between sale.price and gross.square.feet vary by borough_name?
The linear regression with an interaction between predictors \(X_{1}\) and \(X_{2}\) are: \[Y_{\texttt{i}} \,=\, b_{0} \,+\, b_{1}\,X_{1,\texttt{i}} \,+\, b_{2}\,X_{2,\texttt{i}} \,+\, b_{3}\,X_{1,\texttt{i}}\times \color{Red}{X_{2,\texttt{i}}} \,+\, e_{\texttt{i}}\;.\]
where
The linear regression with an interaction between predictors \(X_{1}\) and \(X_{2}\in\{\,0, 1\,\}\) are: \[Y_{\texttt{i}} \,=\, b_{0} \,+\, b_{1}\,X_{1,\texttt{i}} \,+\, b_{2}\,X_{2,\texttt{i}} \,+\, b_{3}\,X_{1,\texttt{i}}\times \color{Red}{X_{2,\texttt{i}}} \,+\, e_{\texttt{i}},\] where \(X_{\,2, \texttt{i}}\) is either 0 or 1.
For \(\texttt{i}\) such that \(X_{\,2, \texttt{i}} = 0\), the model is \[Y_{\texttt{i}} \,=\, b_{0} \,+\, b_{1}\,X_{1,\texttt{i}} \,+\, e_{\texttt{i}}\qquad\qquad\qquad\qquad\qquad\quad\;\;\]
For \(\texttt{i}\) such that \(X_{\,2, \texttt{i}} = 1\), the model is \[Y_{\texttt{i}} \,=\, (\,b_{0} \,+\, b_{2}\,) \,+\, (\,b_{1}\,+\, b_{3}\,)\,X_{1,\texttt{i}} \,+\, e_{\texttt{i}}\qquad\qquad\]
sale.price related with gross.square.feet? \[
\begin{align}
\texttt{sale_price[i]} \;=\;\, &\texttt{b0} \,+\,\\
&\texttt{b1*Bronx[i]} \,+\,\texttt{b2*Brooklyn[i]} \,+\,\\&\texttt{b3*Queens[i]} \,+\,\texttt{b4*Staten Island[i]}\,+\,\\
&\texttt{b5*age[i]}\,+\,\\
&\texttt{b6*gross_square_feet[i]} \,+\,\texttt{e[i]}
\end{align}
\]sale.price and gross.square.feet vary by borough_name? \[
\begin{align}
\texttt{sale_price[i]} \;=\;\, &\texttt{b0} \,+\,\\
&\texttt{b1*Bronx[i]} \,+\,\texttt{b2*Brooklyn[i]} \,+\,\\
&\texttt{b3*Queens[i]} \,+\,\texttt{b4*Staten Island[i]}\,+\, \\
&\texttt{b5*age[i]}\,+\,\\
&\texttt{b6*gross_square_feet[i]} \,+\,\\
&\texttt{b7*gross_square_feet[i]*Bronx[i]} \,+\, \\
&\texttt{b8*gross_square_feet[i]*Brooklyn[i]} \,+\, \\
&\texttt{b9*gross_square_feet[i]*Queens[i]} \,+\, \\
&\texttt{b10*gross_square_feet[i]*Staten Island[i]} \,+\, \texttt{e[i]} \\
\end{align}
\]price and sales (in number of cartons “sold”) for three OJ brandsTropicana, Minute Maid, Dominick’sad, showing whether each brand was advertised (in store or flyer) that week.| Variable | Description |
|---|---|
sales |
Quantity of OJ cartons sold |
price |
Price of OJ |
brand |
Brand of OJ |
ad |
Advertisement status |
The following model estimates the price elasticity of demand for a carton of OJ: \[\log(\texttt{sales}_{\texttt{i}}) \,=\, \quad\;\; b_{\texttt{intercept}} \,+\, b_{\,\texttt{mm}}\,\texttt{brand}_{\,\texttt{mm}, \texttt{i}} \,+\, b_{\,\texttt{tr}}\,\texttt{brand}_{\,\texttt{tr}, \texttt{i}}\\ \,+\, b_{\texttt{price}}\,\log(\texttt{price}_{\texttt{i}}) \,+\, e_{\texttt{i}}\]
When \(\texttt{brand}_{\,\texttt{tr}, \texttt{i}}\,=\,0\) and \(\texttt{brand}_{\,\texttt{mm}, \texttt{i}}\,=\,0\), the beta coefficient for the intercept \(b_{\texttt{intercept}}\) gives the value of Dominick’s log sales at \(\log(\,\texttt{price[i]}\,) = 0\).
The beta coefficient \(b_{\texttt{price}}\) is the price elasticity of demand.
For small changes in variable \(x\) from \(x_{0}\) to \(x_{1}\), the following equation holds: \[\Delta \log(x) \,= \, \log(x_{1}) \,-\, \log(x_{0}) \approx\, \frac{x_{1} \,-\, x_{0}}{x_{0}} \,=\, \frac{\Delta\, x}{x_{0}}.\]
The coefficient on \(\log(\texttt{price}_{\texttt{i}})\), \(b_{\texttt{price}}\), is therefore \[b_{\texttt{price}} \,=\, \frac{\Delta \log(\texttt{sales}_{\texttt{i}})}{\Delta \log(\texttt{price}_{\texttt{i}})}\,=\, \frac{\frac{\Delta \texttt{sales}_{\texttt{i}}}{\texttt{sales}_{\texttt{i}}}}{\frac{\Delta \texttt{price}_{\texttt{i}}}{\texttt{price}_{\texttt{i}}}}.\]
All else being equal, an increase in \(\texttt{price}\) by 1% is associated with a decrease in \(\texttt{sales}\) by \(b_{\texttt{price}}\)%.
Describe the relationship between log(price) and log(sales) by brand.
Describe the relationship between log(price) and log(sales) by brand and ad.
model_1: \[
\begin{align}
\log(\texttt{sales}_{\texttt{i}}) \,=\, &\quad\;\; b_{\texttt{intercept}} \\ &\,+\,b_{\,\texttt{mm}}\,\texttt{brand}_{\,\texttt{mm}, \texttt{i}} \,+\, b_{\,\texttt{tr}}\,\texttt{brand}_{\,\texttt{tr}, \texttt{i}}\\
&\,+\, b_{\texttt{price}}\,\log(\texttt{price}_{\texttt{i}}) \,+\, e_{\texttt{i}}
\end{align}
\]How does the relationship between log(sales) and log(price) vary by brand?
model_2, that addresses the above question: \[
\begin{align}
\log(\texttt{sales}_{\texttt{i}}) \,=\,&\;\; \quad b_{\texttt{intercept}} \,+\, \color{Green}{b_{\,\texttt{mm}}\,\texttt{brand}_{\,\texttt{mm}, \texttt{i}}} \,+\, \color{Blue}{b_{\,\texttt{tr}}\,\texttt{brand}_{\,\texttt{tr}, \texttt{i}}}\\
&\,+\, b_{\texttt{price}}\,\log(\texttt{price}_{\texttt{i}}) \\
&\, +\, b_{\texttt{price*mm}}\,\log(\texttt{price}_{\texttt{i}})\,\times\,\color{Green} {\texttt{brand}_{\,\texttt{mm}, \texttt{i}}} \\
&\,+\, b_{\texttt{price*tr}}\,\log(\texttt{price}_{\texttt{i}})\,\times\,\color{Blue} {\texttt{brand}_{\,\texttt{tr}, \texttt{i}}} \,+\, e_{\texttt{i}}
\end{align}
\]For \(\texttt{i}\) such that \(\color{Green}{\texttt{brand}_{\,\texttt{mm}, \texttt{i}}} = 0\) and \(\color{Blue}{\texttt{brand}_{\,\texttt{tr}, \texttt{i}}} = 0\), the model equation is: \[\log(\texttt{sales}_{\texttt{i}}) \,=\, \; \,b_{\texttt{intercept}}\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\\ \qquad\,+\, b_{\texttt{price}} \,\log(\texttt{price}_{\texttt{i}}) \,+\, e_{\texttt{i}}\,.\qquad\qquad\;\]
For \(\texttt{i}\) such that \(\color{Green}{\texttt{brand}_{\,\texttt{mm}, \texttt{i}}} = 1\) and \(\color{Blue}{\texttt{brand}_{\,\texttt{tr}, \texttt{i}}} = 0\), the model equation is: \[\log(\texttt{sales}_{\texttt{i}}) \,=\, \; (\,b_{\texttt{intercept}} \,+\, b_{\,\texttt{mm}}\,)\qquad\qquad\qquad\qquad\qquad\qquad\qquad\\ \qquad\!\,+\,(\, b_{\texttt{price}} \,+\, \color{Green}{b_{\texttt{price*mm}}}\,)\,\log(\texttt{price}_{\texttt{i}}) \,+\, e_{\texttt{i}}\,.\]
For \(\texttt{i}\) such that \(\color{Green}{\texttt{brand}_{\,\texttt{mm}, \texttt{i}}} = 0\) and \(\color{Blue}{\texttt{brand}_{\,\texttt{tr}, \texttt{i}}} = 1\), the model equation is: \[\log(\texttt{sales}_{\texttt{i}}) \,=\, \; (\,b_{\texttt{intercept}} \,+\, b_{\,\texttt{tr}}\,)\qquad\qquad\qquad\qquad\qquad\qquad\qquad\\ \qquad\!\,+\,(\, b_{\texttt{price}} \,+\, \color{Blue}{b_{\texttt{price*tr}}}\,)\,\log(\texttt{price}_{\texttt{i}}) \,+\, e_{\texttt{i}}\,.\]
model_3: \[
\begin{align}
\log(\texttt{sales}_{\texttt{i}}) \,=\,\quad\;\;& b_{\texttt{intercept}} \,+\, \color{Green}{b_{\,\texttt{mm}}\,\texttt{brand}_{\,\texttt{mm}, \texttt{i}}} \,+\, \color{Blue}{b_{\,\texttt{tr}}\,\texttt{brand}_{\,\texttt{tr}, \texttt{i}}} \\
&\,+\; b_{\,\texttt{ad}}\,\color{Orange}{\texttt{ad}_{\,\texttt{i}}} \qquad\qquad\qquad\qquad\quad \\
&\,+\, b_{\texttt{mm*ad}}\,\color{Green} {\texttt{brand}_{\,\texttt{mm}, \texttt{i}}}\,\times\, \color{Orange}{\texttt{ad}_{\,\texttt{i}}}\,+\, b_{\texttt{tr*ad}}\,\color{Blue} {\texttt{brand}_{\,\texttt{tr}, \texttt{i}}}\,\times\, \color{Orange}{\texttt{ad}_{\,\texttt{i}}} \\
&\,+\; b_{\texttt{price}}\,\log(\texttt{price}_{\texttt{i}}) \qquad\qquad\qquad\;\;\;\;\, \\
&\,+\, b_{\texttt{price*mm}}\,\log(\texttt{price}_{\texttt{i}})\,\times\,\color{Green} {\texttt{brand}_{\,\texttt{mm}, \texttt{i}}}\qquad\qquad\qquad\;\, \\
&\,+\, b_{\texttt{price*tr}}\,\log(\texttt{price}_{\texttt{i}})\,\times\,\color{Blue} {\texttt{brand}_{\,\texttt{tr}, \texttt{i}}}\qquad\qquad\qquad\;\, \\
& \,+\, b_{\texttt{price*ad}}\,\log(\texttt{price}_{\texttt{i}})\,\times\,\color{Orange}{\texttt{ad}_{\,\texttt{i}}}\qquad\qquad\qquad\;\;\, \\
&\,+\, b_{\texttt{price*mm*ad}}\,\log(\texttt{price}_{\texttt{i}}) \,\times\,\,\color{Green} {\texttt{brand}_{\,\texttt{mm}, \texttt{i}}}\,\times\, \color{Orange}{\texttt{ad}_{\,\texttt{i}}} \\
&\,+\, b_{\texttt{price*tr*ad}}\,\log(\texttt{price}_{\texttt{i}}) \,\times\,\,\color{Blue} {\texttt{brand}_{\,\texttt{tr}, \texttt{i}}}\,\times\, \color{Orange}{\texttt{ad}_{\,\texttt{i}}} \,+\, e_{\texttt{i}}
\end{align}
\]Describe how the distribution of brand varies by ad using stacked bar charts.
price and sales can vary by ad.How would you explain different estimation results across different models?
Which model do you prefer? Why?