Linear Regression
February 17, 2025
Big in both the number of observations (size n
) and in the number of variables (dimension p
).
In these settings, we cannot:
Some ML tools are straight out of previous statistics classes (linear regression) and some are totally new (ensemble models, principal component analysis).
n
and p
get really big.\[Y_{i} \,=\, \beta_{0} \,+\, \beta_{1} X_{1, i} \,+\, \epsilon_{i}\] for \(i \,=\, 1, 2, \dots, n\), where \(i\) is the \(i\)-th observation in data.
\(Y_i\) is the \(i\)-th value for the outcome/dependent/response/target variable \(Y\).
\(X_{1, i}\) is the \(i\)-th value for the explanatory/independent/predictor/input variable or feature \(X_{1}\).
\[Y_{i} \,=\, \beta_{0} \,+\, \beta_{1} X_{1, i} \,+\, \epsilon_{i}\] for \(i \,=\, 1, 2, \dots, n\), where \(i\) is the \(i\)-th observation in data.
\(\beta_0\) is an unknown true value of an intercept: average value for \(Y\) if \(X_{1} = 0\)
\(\beta_1\) is an unknown true value of a slope: increase in average value for \(Y\) for each one-unit increase in \(X_{1}\)
\[Y_{i} \,=\, \beta_{0} \,+\, \beta_{1} X_{1, i} \,+\, \epsilon_{i}\] for \(i \,=\, 1, 2, \dots, n\), where \(i\) is the \(i\)-th observation in data.
\[ \epsilon_i \sim N(0, \sigma^2) \]
Linear regression finds the beta estimates \(( \hat{\beta_{0}}, \hat{\beta_{1}} )\) such that:
– The linear function \(f(X_{1}) = \hat{\beta_{0}} + \hat{\beta_{1}}X_{1}\) is as near as possible to \(Y\) for all \((X_{1, i}\,,\, Y_{i})\) pairs in the data.
We use the hat notation \((\,\hat{\texttt{ }_{\,}}\,)\) to distinguish true values and estimated/predicted values.
The value of true beta coefficient is denoted by \(\beta_{1}\).
The value of estimated beta coefficient is denoted by \(\hat{\beta_{1}}\).
The \(i\)-th value of true outcome variable is denoted by \(Y_{i}\).
The \(i\)-th value of predicted outcome variable is denoted by \(\hat{Y_{i}}\).
1. Finding the relationship between \(X_{1}\) and \(Y\) \[\hat{\beta_{1}}\]: How is an increase in \(X_1\) by one unit associated with a change in \(Y\) on average? - Positive? Negative? Independent? - How strong?
2. Making a prediction on \(Y\): \[\hat{Y_{\,}}\] For unseen data point of \(X_1\), what is the predicted value of outcome, \(\hat{Y_{\,}}\)?
i
, we want to predict sale_price[i]
based on gross_square_feet[i]
.gross_square_feet[i]
is associated with sale_price[i]
.Linear regression assumes that:
sale_price[i]
is linearly related to the input gross_square_feet[i]
:\[\texttt{sale_price[i]} \;=\quad \texttt{b0} \,+\, \texttt{b1*gross_square_feet[i]} \,+\, \texttt{e[i]}\] where e[i]
is a statistical error term.
sale_price
and gross_square_feet
\[ MSE = SSR / n \] \[ \begin{align} SSR &\,=\, (\texttt{Residual_Error}_{1})^{2}\\ &\quad \,+\, (\texttt{Residual_Error}_{2})^{2}\\ &\quad\,+\, \cdots + (\texttt{Residual_Error}_{n})^{2} \end{align} \]
R-squared is a measure of how well the model “fits” the data, or its “goodness of fit.”
y
’s variation is explained by the explanatory variables.We want R-squared to be fairly large and R-squareds that are similar on testing and training.
Caution: R-squared will be higher for models with more explanatory variables, regardless of whether the additional explanatory variables actually improve the model or not.
gross_square_feet
and sale_price
by estimating a true value of b1
.b1
is denoted by \(\hat{\texttt{b1}}\).sale_price[i]
for new property i
sale_price[i]
is denoted by \(\widehat{\texttt{sale_price}}\texttt{[i]}\), where\[\widehat{\texttt{sale_price}}\texttt{[i]} \;=\quad \hat{\texttt{b0}} \,+\, \hat{\texttt{b1}}\texttt{*gross_square_feet[i]}\]
Training data: When we’re building a linear regression model, we need data to train the model.
Test data: We also need data to test whether the model works well on new data.
The probability density function for the uniform distribution looks like:
With the uniform distribution, any values of \(x\) between 0 and 1 is equally likely drawn.
sale_price > 10^6
are in the training data and the observations with sale_price <= 10^6
are in the test data.
We will use the data for residential property sales from September 2017 and August 2018 in NYC.
Each sales data recorded contains a number of interesting variables, but here we focus on the followings:
sale_price
: a property’s sales price;gross_square_feet
: a property’s size;age
: a property’s age;borough_name
: a borough where a property is located.Use summary statistics and visualization to explore the data.
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.functions import rand, col, pow, mean, when
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
spark = SparkSession.builder.master("local[*]").getOrCreate()
# 1. Read CSV data from URL
df_pd = pd.read_csv('https://bcdanl.github.io/data/home_sales_nyc.csv')
sale_df = spark.createDataFrame(df_pd)
sale_df.show()
rand(seed = ANY_NUMBER)
# 2. Split data into training and testing sets by creating a random column "gp"
sale_df = sale_df.withColumn("gp", rand(seed=123)) # seed is set for replication
# Splits 60-40 into training and test sets
dtrain = sale_df.filter(col("gp") >= 0.4)
dtest = sale_df.filter(col("gp") < 0.4)
# Or simply,
dtrain, dtest = sale_df.randomSplit([0.6, 0.4], seed = 123)
VectorAssembler()
# Now assemble predictors using the renamed column
assembler1 = VectorAssembler(
inputCols=["gross_square_feet"],
outputCol="predictors")
VectorAssembler
is a transformer in PySpark’s ML library that is used to combine multiple columns into a single vector column.
VectorAssembler
is often one of the first steps in a Spark ML pipeline.VectorAssembler()
dtrain1 = assembler1.transform(dtrain) # training data
dtest1 = assembler1.transform(dtest) # test data
dtrain1.select("predictors", "sale_price").show()
dtest1.select("predictors", "sale_price").show()
VectorAssembler.transform()
returns a DataFrame
with a new column, specified in outputCol
in VectorAssembler()
.LinearRegression().fit()
# Fit linear regression model using the new label column "sale_price"
model1 = (
LinearRegression(
featuresCol = "predictors",
labelCol = "sale_price")
.fit(dtrain1)
)
LinearRegression(featuresCol="predictors", labelCol="sale_price")
LinearRegression
class..fit()
trains (fits) the linear regression model using the training DataFrame dtrain1
.
LinearRegression().fit()
dtest1 = model1.transform(dtest1)
: Adds a new column prediction
to dtest1
DataFrame.
model1
to predict an outcome for the test dataset dtest1
.
import numpy as np
import scipy.stats as stats
from tabulate import tabulate
def regression_table(model, assembler):
"""
Creates a formatted regression table from a fitted LinearRegression model and its VectorAssembler,
and inserts a dashed horizontal line after the Intercept row. The table includes separate columns
for the 95% confidence interval lower and upper bounds for each coefficient (computed at the 5% significance level)
and an "Observations" row (using model.summary.numInstances) above the R² row.
The RMSE row is placed as the last row.
The columns are ordered as:
Metric | Value | Significance | Std. Error | p-value | 95% CI Lower | 95% CI Upper
For the "Value", "Std. Error", "95% CI Lower", and "95% CI Upper" columns, commas are inserted every three digits,
with 3 decimal places (except for Observations which is formatted as an integer with commas).
Parameters:
model: A fitted LinearRegression model (with a .summary attribute).
assembler: The VectorAssembler used to assemble the features for the model.
Returns:
A formatted string containing the regression table.
"""
# Extract coefficients and standard errors as NumPy arrays
coeffs = model.coefficients.toArray()
std_errors_all = np.array(model.summary.coefficientStandardErrors)
# Check if the intercept's standard error is included (one extra element)
if len(std_errors_all) == len(coeffs) + 1:
intercept_se = std_errors_all[0]
std_errors = std_errors_all[1:]
else:
intercept_se = None
std_errors = std_errors_all
# Compute t-statistics for feature coefficients (t = beta / SE(beta))
# t_stats = coeffs / std_errors
t_stats = model.summary.tValues
# Degrees of freedom: number of instances minus number of predictors minus 1 (for intercept)
df = model.summary.numInstances - len(coeffs) - 1
# Compute the t-critical value for a 95% confidence interval (two-tailed, 5% significance)
t_critical = stats.t.ppf(0.975, df)
# Compute two-tailed p-values for each feature coefficient
# p_values = [2 * (1 - stats.t.cdf(np.abs(t), df)) for t in t_stats]
p_values = model.summary.pValues
# Function to assign significance stars based on p-value
def significance_stars(p):
if p < 0.01:
return "***"
elif p < 0.05:
return "**"
elif p < 0.1:
return "*"
else:
return ""
# Build the table rows.
# Order: Metric, Value, Significance, Std. Error, p-value, 95% CI Lower, 95% CI Upper.
table = []
for feature, beta, se, p in zip(assembler.getInputCols(), coeffs, std_errors, p_values):
ci_lower = beta - t_critical * se
ci_upper = beta + t_critical * se
table.append([
"Beta: " + feature, # Metric name
beta, # Beta estimate (Value)
significance_stars(p), # Significance stars
se, # Standard error
p, # p-value
ci_lower, # 95% CI lower bound
ci_upper # 95% CI upper bound
])
# Compute and add the intercept row with its SE, p-value, significance, and CI (if available)
if intercept_se is not None:
intercept_t = model.intercept / intercept_se
intercept_p = 2 * (1 - stats.t.cdf(np.abs(intercept_t), df))
intercept_sig = significance_stars(intercept_p)
ci_intercept_lower = model.intercept - t_critical * intercept_se
ci_intercept_upper = model.intercept + t_critical * intercept_se
else:
intercept_se = ""
intercept_p = ""
intercept_sig = ""
ci_intercept_lower = ""
ci_intercept_upper = ""
table.append([
"Intercept",
model.intercept,
intercept_sig,
intercept_se,
intercept_p,
ci_intercept_lower,
ci_intercept_upper
])
# Append overall model metrics:
# Insert an Observations row using model.summary.numInstances,
# then an R² row, and finally the RMSE row as the last row.
table.append(["Observations", model.summary.numInstances, "", "", "", "", ""])
table.append(["R²", model.summary.r2, "", "", "", "", ""])
table.append(["RMSE", model.summary.rootMeanSquaredError, "", "", "", "", ""])
# Format the table.
# For the "Value" (index 1), "Std. Error" (index 3), "95% CI Lower" (index 5), and "95% CI Upper" (index 6) columns,
# format with commas and 3 decimal places, except for Observations which should be an integer with commas.
# For the p-value (index 4), format to 3 decimal places.
formatted_table = []
for row in table:
formatted_row = []
for i, item in enumerate(row):
if row[0] == "Observations" and i == 1 and isinstance(item, (int, float, np.floating)) and item != "":
# Format Observations as integer with commas, no decimals.
formatted_row.append(f"{int(item):,}")
elif isinstance(item, (int, float, np.floating)) and item != "":
if i in [1, 3, 5, 6]:
formatted_row.append(f"{item:,.3f}")
elif i == 4:
formatted_row.append(f"{item:.3f}")
else:
formatted_row.append(f"{item:.3f}")
else:
formatted_row.append(item)
formatted_table.append(formatted_row)
# Generate the table string using tabulate.
table_str = tabulate(
formatted_table,
headers=["Metric", "Value", "Sig.", "Std. Error", "p-value", "95% CI Lower", "95% CI Upper"],
tablefmt="pretty",
colalign=("left", "right", "center", "right", "right", "right", "right")
)
# Insert a dashed line after the Intercept row for clarity.
lines = table_str.split("\n")
dash_line = '-' * len(lines[0])
for i, line in enumerate(lines):
if "Intercept" in line and not line.strip().startswith('+'):
lines.insert(i+1, dash_line)
break
return "\n".join(lines)
# Example usage:
# print(regression_table(lr_model, assembler))
\[Y_{i} \,=\, \beta_{0} \,+\, \beta_{1} X_{1, i} \,+\,\beta_{2} X_{2, i} \,+\, \epsilon_{i}\] for \(i \,=\, 1, 2, \dots, n\), where \(i\) is the \(i\)-th observation in data.
\(\beta_0\) is an unknown true value of an intercept: average value for \(Y\) if \(X_{1} = 0\) and \(X_{2} = 0\)
\(\beta_1\) is an unknown true value of a slope: increase in average value for \(Y\) for each one-unit increase in \(X_{1}\)
\(\beta_2\) is an unknown true value of a slope: increase in average value for \(Y\) for each one-unit increase in \(X_{2}\)
\[Y_{i} \,=\, \beta_{0} \,+\, \beta_{1} X_{1, i}\,+\, \beta_{1} X_{2, i} \,+\, \epsilon_{i}\] for \(i \,=\, 1, 2, \dots, n\), where \(i\) is the \(i\)-th observation in data.
\[ \epsilon_i \sim N(0, \sigma^2) \]
Linear regression finds the beta estimates \(( \hat{\beta_{0}}, \hat{\beta_{1}}, \hat{\beta_{2}} )\) such that:
– The linear function \(f(X_{1}, X_{2}) = \hat{\beta_{0}} + \hat{\beta_{1}}X_{1} + \hat{\beta_{2}}X_{2}\) is as near as possible to \(Y\) for all \((X_{1, i}\,,\,X_{2, i}\,,\, Y_{i})\) pairs in the data.
gross_square_feet
by one unit is associated with an increase in sale_price
by \(\hat{\beta_{1}}\).age
, to the model.assembler2 = VectorAssembler(
inputCols=["gross_square_feet", "age"],
outputCol="predictors")
dtrain2 = assembler2.transform(dtrain)
dtest2 = assembler2.transform(dtest)
model2 = LinearRegression(
featuresCol="predictors",
labelCol="sale_price")
.fit(dtrain2)
dtest2 = model2.transform(dtest2)
print(regression_table(model2, assembler2))
Linear regression models require numerical predictors, but many variables are categorical.
The Approach:
Convert categorical variables into numerical format using dummy variables.
Why Do This?
Example:
\[ D_i = \begin{cases} 1, & \text{if the observation belongs to the category} \\ 0, & \text{otherwise} \end{cases} \]
Consider a regression model including a dummy variable:
\[ y_i = \beta_0 + \beta_1 x_i + \beta_2 D_i + \epsilon_i \]
Interpretation:
\(\beta_2\) captures the difference in the response \(y\) when the category is present (i.e., \(D_i=1\)) versus absent.
Problem: Including a dummy for every category introduces redundancy.
For a categorical variable with \(k\) levels:
\[ D_{1i} + D_{2i} + \cdots + D_{ki} = 1 \quad \text{(for each observation)} \]
This is problematic, because one dummy is completely predictable from the others.
The intercept already captures the constant part (1), making one of the dummy variables redundant.
Solution: Drop one dummy (often called the reference category)
The reference category is represented by a combination of \(\texttt{borough}\) variables.
Proper model:
\[ y_i = \beta_0 + \beta_1 D_{1, i} + \beta_2 D_{2, i} + \cdots + \beta_{k-1} D_{(k-1), i} + \epsilon_i \]
add_dummy_variables(var_name, reference_level)
convert a categorical variable into its dummy variables:
var_name
: a string of categorical variable namereference_level
: index position of alphabetically sorted categoriesdtrain
and dtest
DataFrames.def add_dummy_variables(var_name, reference_level, category_order=None):
"""
Creates dummy variables for the specified column in the global DataFrames dtrain and dtest.
Allows manual setting of category order.
Parameters:
var_name (str): The name of the categorical column (e.g., "borough_name").
reference_level (int): Index of the category to be used as the reference (dummy omitted).
category_order (list, optional): List of categories in the desired order. If None, categories are sorted.
Returns:
dummy_cols (list): List of dummy column names excluding the reference category.
ref_category (str): The category chosen as the reference.
"""
global dtrain, dtest
# Get distinct categories from the training set.
categories = dtrain.select(var_name).distinct().rdd.flatMap(lambda x: x).collect()
# Convert booleans to strings if present.
categories = [str(c) if isinstance(c, bool) else c for c in categories]
# Use manual category order if provided; otherwise, sort categories.
if category_order:
# Ensure all categories are present in the user-defined order
missing = set(categories) - set(category_order)
if missing:
raise ValueError(f"These categories are missing from your custom order: {missing}")
categories = category_order
else:
categories = sorted(categories)
# Validate reference_level
if reference_level < 0 or reference_level >= len(categories):
raise ValueError(f"reference_level must be between 0 and {len(categories) - 1}")
# Define the reference category
ref_category = categories[reference_level]
print("Reference category (dummy omitted):", ref_category)
# Create dummy variables for all categories
for cat in categories:
dummy_col_name = var_name + "_" + str(cat).replace(" ", "_")
dtrain = dtrain.withColumn(dummy_col_name, when(col(var_name) == cat, 1).otherwise(0))
dtest = dtest.withColumn(dummy_col_name, when(col(var_name) == cat, 1).otherwise(0))
# List of dummy columns, excluding the reference category
dummy_cols = [var_name + "_" + str(cat).replace(" ", "_") for cat in categories if cat != ref_category]
return dummy_cols, ref_category
# Example usage without category_order:
# dummy_cols_year, ref_category_year = add_dummy_variables('year', 0)
# Example usage with category_order:
# custom_order_wkday = ['sunday', 'monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday']
# dummy_cols_wkday, ref_category_wkday = add_dummy_variables('wkday', reference_level=0, category_order = custom_order_wkday)
add_dummy_variables(var_name, reference_level):
# 0: Bronx, 1: Brooklyn, 2: Manhattan, 3: Queens, 4: Staten Island
dummy_cols, ref_category = add_dummy_variables("borough_name", 2)
# To see dummy variables
dtrain.select(['borough_name'] + dummy_cols).show()
dtrain.select(['borough_name'] + dummy_cols).filter(col('borough_name') == "Manhattan").show()
# Define the list of predictor columns:
# Two numeric predictors plus the dummy variables (excluding the reference dummy)
conti_cols = ["gross_square_feet", "age"]
assembler_predictors = conti_cols + dummy_cols
assembler3 = VectorAssembler(
inputCols = assembler_predictors,
outputCol = "predictors"
)
dtrain3 = assembler3.transform(dtrain)
dtest3 = assembler3.transform(dtest)
model3 = LinearRegression(featuresCol="predictors", labelCol="sale_price").fit(dtrain3)
dtest3 = model3.transform(dtest3)
# For model3 and assembler3:
print(regression_table(model3, assembler3))
\[ \epsilon_i \sim N(0, \sigma^2) \] - Errors have a mean value of 0 with constant variance \(\sigma^2\).
If we re-arrange the simple regression equation, \[\begin{align} {\epsilon}_{i} \,=\, Y_{i} \,-\, (\, {\beta}_{0} \,+\, {\beta}_{1}X_{1,i} \,). \end{align}\]
\(\texttt{residual_error}_{i}\) can be thought of as the expected value of \(\epsilon_{i}\), denoted by \(\hat{\epsilon_{i}}\). \[\begin{align} \hat{\epsilon_{i}} \,=\, Y_{i} \,-\, (\, \hat{\beta_{0}} \,+\, \hat{\beta_{1}}X_{1,i} \,) \end{align}\]
Residual plot is a scatterplot of fitted values and residuals.
A residual plot can be used to diagnose the quality of model results.
Because we assume that \(\epsilon_{i}\) have a mean value of 0 with constant variance \(\sigma^2\), a well-behaved residual plot should bounce randomly and form a cloud roughly at the level of zero residual, the perfect prediction line.
From the residual plot, we should ask the following the two questions ourselves:
DataFrame
into Pandas DataFrame
by using .toPandas()
matplotlib.pyplot
to draw a residual plot:import matplotlib.pyplot as plt
import statsmodels.api as sm # for lowess smoothing
plt.scatter(rdf["prediction"], rdf["residual"], alpha=0.2, color="darkgray")
# Use lowess smoothing for the trend line
smoothed = sm.nonparametric.lowess(rdf["residual"], rdf["prediction"])
plt.plot(smoothed[:, 0], smoothed[:, 1], color="darkblue")
# Perfect prediction line
# A horizontal line at residual = 0
plt.axhline(y=0, color="red", linestyle="--")
# Labeling
plt.xlabel("Predicted sale.price (Model 2)")
plt.ylabel("Residuals")
plt.title("Residual Plot for Model 2")
plt.show()
To determine whether an independent variable has a statistically significant effect on the dependent variable in a linear regression model.
Consider the following linear regression model: \[ y_{i} = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + \dots + \beta_k x_{k,i} + \epsilon_{i} \]
\(y\): Outcome
\(x_1, x_2, \dots, x_k\): Predictors
\(\beta_0\): Intercept
\(\beta_1, \beta_2, \dots, \beta_k\): Coefficients
\(\epsilon_{i}\): Error term
\[ t = \frac{\hat{\beta_j} - 0}{SE(\hat{\beta_j})} \]
*
(10%); **
(5%); ***
(1%)The model equation is \[\begin{align}
\texttt{sale_price[i]} \;=\;\, &\texttt{b0} \,+\,\\ &\texttt{b1*gross_square_feet[i]} \,+\,\texttt{b2*age[i]}\,+\,\\ &\texttt{b3*Bronx[i]} \,+\,\texttt{b4*Brooklyn[i]} \,+\,\\&\texttt{b5*Queens[i]} \,+\,\texttt{b6*Staten Island[i]}\,+\,\\ &\texttt{e[i]}
\end{align}\] - The reference level of borough_name
variables is Manhattan
.
gross_square_feet
A
and B
.
A
and B
are in Bronx
and with the same age
.gross_square_feet
of house A
is 2001, while that of house B
is 2000.gross_square_feet
by one unit is associated with an increase in sale_price
by \(\hat{\beta_{1}}\).
gross_square_feet
\[ \begin{align}\widehat{\texttt{sale_price[A]}} \;=\quad& \hat{\texttt{b0}} \,+\, \hat{\texttt{b1}}\texttt{*gross_square_feet[A]} \,+\, \hat{\texttt{b2}}\texttt{*age[A]} \,+\,\\ &\hat{\texttt{b3}}\texttt{*Bronx[A]}\,+\,\hat{\texttt{b4}}\texttt{*Brooklyn[A]} \,+\,\\ &\hat{\texttt{b5}}\texttt{*Queens[A]}\,+\, \hat{\texttt{b6}}\texttt{*Staten Island[A]}\\ \widehat{\texttt{sale_price[B]}} \;=\quad& \hat{\texttt{b0}} \,+\, \hat{\texttt{b1}}\texttt{*gross_square_feet[B]} \,+\, \hat{\texttt{b2}}\texttt{*age[B]}\,+\,\\ &\hat{\texttt{b3}}\texttt{*Bronx[B]}\,+\, \hat{\texttt{b4}}\texttt{*Brooklyn[B]} \,+\,\\ &\hat{\texttt{b5}}\texttt{*Queens[B]}\,+\, \hat{\texttt{b6}}\texttt{*Staten Island[B]} \end{align} \]
\[ \begin{align}\Leftrightarrow\qquad&\widehat{\texttt{sale_price[A]}} \,-\, \widehat{\texttt{sale_price[B]}}\qquad \\ \;=\quad &\hat{\texttt{b1}}\texttt{*}(\texttt{gross_square_feet[A]} - \texttt{gross_square_feet[B]})\\ \;=\quad &\hat{\texttt{b1}}\texttt{*}\texttt{(2001 - 2000)} \,=\, \hat{\texttt{b1}}\qquad\qquad\quad\;\; \end{align} \]
borough_nameBronx
Consider the predicted sales prices of the two houses, A
and C
.
A
and C
are with the same age
and the same gross_square_feet
.A
is in Bronx
, and C
is in Manhattan
.All else being equal, an increase in borough_nameBronx
by one unit is associated with an increase in sale_price
by b3
.
Equivalently, all else being equal, being in Bronx
relative to being a in Manhattan
is associated with a decrease in sale_price
by |b3|
.
borough_nameBronx
\[ \begin{align}\widehat{\texttt{sale_price[A]}} \;=\quad& \hat{\texttt{b0}} \,+\, \hat{\texttt{b1}}\texttt{*gross_square_feet[A]} \,+\, \hat{\texttt{b2}}\texttt{*age[A]} \,+\,\\ &\hat{\texttt{b3}}\texttt{*Bronx[A]}\,+\, \hat{\texttt{b4}}\texttt{*Brooklyn[A]} \,+\,\\ &\hat{\texttt{b5}}\texttt{*Queens[A]}\,+\, \hat{\texttt{b6}}\texttt{*Staten Island[A]}\\ \widehat{\texttt{sale_price[C]}} \;=\quad& \hat{\texttt{b0}} \,+\, \hat{\texttt{b1}}\texttt{*gross_square_feet[C]} \,+\, \hat{\texttt{b2}}\texttt{*age[C]}\,+\,\\ &\hat{\texttt{b3}}\texttt{*Bronx[C]}\,+\, \hat{\texttt{b4}}\texttt{*Brooklyn[C]} \,+\,\\ &\hat{\texttt{b5}}\texttt{*Queens[C]}\,+\, \hat{\texttt{b6}}\texttt{*Staten Island[C]} \end{align} \]
\[ \begin{align}\Leftrightarrow\qquad&\widehat{\texttt{sale_price[A]}} \,-\, \widehat{\texttt{sale_price[C]}}\qquad \\ \;=\quad &\hat{\texttt{b3}}\texttt{*}\texttt{Bronx[A]} \\ \;=\quad &\hat{\texttt{b3}}\qquad\qquad\qquad\qquad\quad\;\;\;\,\end{align} \]
import matplotlib.pyplot as plt
import pandas as pd
# Create a Pandas DataFrame from model3's summary information
terms = assembler3.getInputCols()
coefs = model3.coefficients.toArray()[:len(terms)]
stdErrs = model3.summary.coefficientStandardErrors[:len(terms)]
df_summary = pd.DataFrame({
"term": terms,
"estimate": coefs,
"std_error": stdErrs
})
# Plot using the DataFrame columns
plt.errorbar(df_summary["term"], df_summary["estimate"],
yerr = 1.96 * df_summary["std_error"], fmt='o', capsize=5)
plt.xlabel("Terms")
plt.ylabel("Coefficient Estimate")
plt.title("Coefficient Estimates (Model 2)")
plt.axhline(0, color="red", linestyle="--") # Add horizontal line at 0
plt.xticks(rotation=45)
plt.show()
The model equation with log-transformed \(\texttt{sale.price[i]}\) is \[\begin{align} \log(\texttt{sale.price[i]}) \;=\;\, &\texttt{b0} \,+\,\\ &\texttt{b1*gross.square.feet[i]} \,+\,\texttt{b2*age[i]}\,+\,\\ &\texttt{b3*Bronx[i]} \,+\,\texttt{b4*Brooklyn[i]} \,+\,\\&\texttt{b5*Queens[i]} \,+\,\texttt{b6*Staten Island[i]}\,+\,\\ &\texttt{e[i]}. \end{align}\]
borough_name
is Manhattan
.gross_square_feet
gross_square_feet
\[\begin{align}&\log(\widehat{\texttt{sale.price}}\texttt{[A]}) - \log(\widehat{\texttt{sale.price}}\texttt{[B]}) \\ \,=\, &\hat{\texttt{b1}}\,*\,(\texttt{gross.square.feet[A]} \,-\, \texttt{gross.square.feet[B]})\\ \,=\, &\hat{\texttt{b1}}\end{align}\]
So we can have the following: \[ \begin{align} &\Leftrightarrow\qquad\frac{\widehat{\texttt{sale.price[A]}}}{ \widehat{\texttt{sale.price[B]}}} \;=\; \texttt{exp(}\hat{\texttt{b1}}\texttt{)}\\ \quad&\Leftrightarrow\qquad\widehat{\texttt{sale.price[A]}} \;=\; \widehat{\texttt{sale.price[B]}} * \texttt{exp(}\hat{\texttt{b1}}\texttt{)} \end{align} \]
gross_square_feet
borough_nameBronx
A
is in Bronx
, and C
is in Manhattan
.A
and C
’s age
are the same.A
and C
’s gross.square.feet
are the same.borough_nameBronx
log()
-exp()
rules for \(\widehat{\texttt{sale.price}}\texttt{[A]}\) and \(\widehat{\texttt{sale.price}}\texttt{[C]}\),\[\begin{align}&\log(\widehat{\texttt{sale.price}}\texttt{[A]}) - \log(\widehat{\texttt{sale.price}}\texttt{[C]}) \\ \,=\, &\hat{\texttt{b3}}\,*\,(\texttt{borough_Bronx[A]} \,-\, \texttt{borough_Bronx[C]})\,=\, \hat{\texttt{b3}}\end{align}\]
So we can have the following: \[ \begin{align}&\Leftrightarrow\qquad\frac{\widehat{\texttt{sale.price[A]}}}{ \widehat{\texttt{sale.price[C]}}} \;=\; \texttt{exp(}\hat{\texttt{b3}}\texttt{)}\\ \quad&\Leftrightarrow\qquad\,\widehat{\texttt{sale.price[A]}} \;=\; \widehat{\texttt{sale.price[C]}} * \texttt{exp(}\hat{\texttt{b3}}\texttt{)} \end{align} \]
borough_nameBronx
Bronx
relative to being in Manhattan
is associated with a decrease in \(\texttt{sale.price}\) by 71.78%.gross.square.feet
by one unit is associated with an increase in log(sale.price)
by b1
.
gross.square.feet
by one unit is associated with an increase in sale.price
by (exp(b1) - 1) * 100
%.broughBronx
by one unit is associated with an increase in log(sale.price)
by b3
.
Bronx
is associated with a decrease in sale.price
by |(exp(b3) - 1)| * 100
% relative to being in Manhattan
.log()
function to add a log-transformed variable:from pyspark.sql.functions import log
sale_df = sale_df.withColumn("log_sale_price",
log( sale_df['sale_price'] ) )
exp()
function from numpy
: