Lecture 7

Linear Regression

Byeong-Hak Choe

SUNY Geneseo

February 17, 2025

Big Data and Machine Learning

Big Data and Machine Learning (ML)

  • The term “Big Data” originated from computer scientists working with datasets too large to fit on a single machine.
    • As aggregation evolved into analysis, Big Data became closely linked with statistics and machine learning (ML).
    • Scalability of algorithms is crucial for handling large datasets.
  • We will use ML to:
    ✅ Identify patterns in big data
    ✅ Make data-driven decisions

What does it mean to be “big”?

  • Big in both the number of observations (size n) and in the number of variables (dimension p).

  • In these settings, we cannot:

    • Look at each individual variable and make a decision.
    • Choose among a small set of candidate models.
    • Plot every variable to look for interactions or transformations.
  • Some ML tools are straight out of previous statistics classes (linear regression) and some are totally new (ensemble models, principal component analysis).

    • All require a different approach when n and p get really big.

ML topics

  • Regression: inference and prediction
  • Regularization: cross-validation
  • Principal Component Analysis: dimension reduction
  • Tree-based models: decision trees, random forest, XGBoost
  • Classfication: kNN
  • Clustering: k-means, association rules
  • Text Mining: sentiment analysis; topic models

Linear Regression Model

Models and Assumptions

Linear Model

  • Linear regression assumes a linear relationship for \(Y = f(X_{1})\):

\[Y_{i} \,=\, \beta_{0} \,+\, \beta_{1} X_{1, i} \,+\, \epsilon_{i}\] for \(i \,=\, 1, 2, \dots, n\), where \(i\) is the \(i\)-th observation in data.

  • \(Y_i\) is the \(i\)-th value for the outcome/dependent/response/target variable \(Y\).

  • \(X_{1, i}\) is the \(i\)-th value for the explanatory/independent/predictor/input variable or feature \(X_{1}\).

Models and Assumptions

Beta coefficients

  • Linear regression assumes a linear relationship for \(Y = f(X_{1})\):

\[Y_{i} \,=\, \beta_{0} \,+\, \beta_{1} X_{1, i} \,+\, \epsilon_{i}\] for \(i \,=\, 1, 2, \dots, n\), where \(i\) is the \(i\)-th observation in data.

  • \(\beta_0\) is an unknown true value of an intercept: average value for \(Y\) if \(X_{1} = 0\)

  • \(\beta_1\) is an unknown true value of a slope: increase in average value for \(Y\) for each one-unit increase in \(X_{1}\)

Models and Assumptions

Random Noises

  • Linear regression assumes a linear relationship for \(Y = f(X_{1})\):

\[Y_{i} \,=\, \beta_{0} \,+\, \beta_{1} X_{1, i} \,+\, \epsilon_{i}\] for \(i \,=\, 1, 2, \dots, n\), where \(i\) is the \(i\)-th observation in data.

  • \(\epsilon_i\) is a random noise, or a statistical error:

\[ \epsilon_i \sim N(0, \sigma^2) \]

  • Errors have a mean value of 0 with constant variance \(\sigma^2\).
  • Errors are uncorrelated with \(X_{1, i}\).

What Is Linear Regression Doing?

Best Fitting Line

  • Linear regression finds the beta estimates \(( \hat{\beta_{0}}, \hat{\beta_{1}} )\) such that:

    – The linear function \(f(X_{1}) = \hat{\beta_{0}} + \hat{\beta_{1}}X_{1}\) is as near as possible to \(Y\) for all \((X_{1, i}\,,\, Y_{i})\) pairs in the data.

    • It is the best fitting line, or the predicted outcome, \(\hat{Y_{\,}} = \hat{\beta_{0}} + \hat{\beta_{1}}X_{1}\).

What Is Linear Regression Doing?

Residual errors

  • The estimated beta coefficients are chosen to minimize the sum of squares of the residual errors \((SSR)\): \[ \begin{align} SSR &\,=\, (\texttt{Residual_Error}_{1})^{2}\\ &\quad \,+\, (\texttt{Residual_Error}_{2})^{2}\\ &\quad\,+\, \cdots + (\texttt{Residual_Error}_{n})^{2}\\ \text{where}\qquad\qquad\qquad\qquad&\\ \texttt{Residual_Error}_{i} &\,=\, Y_{i} \,-\, \hat{Y_{i}},\\ \texttt{Predicted_Outcome}_{i}: \hat{Y_{i}} &\,=\, \hat{\beta_{0}} \,+\, \hat{\beta_{1}}X_{1, i} \end{align} \]

What Is Linear Regression Doing?

Hat Notation

  • We use the hat notation \((\,\hat{\texttt{ }_{\,}}\,)\) to distinguish true values and estimated/predicted values.

  • The value of true beta coefficient is denoted by \(\beta_{1}\).

  • The value of estimated beta coefficient is denoted by \(\hat{\beta_{1}}\).

  • The \(i\)-th value of true outcome variable is denoted by \(Y_{i}\).

  • The \(i\)-th value of predicted outcome variable is denoted by \(\hat{Y_{i}}\).

What Is Linear Regression Doing?

Relationship

1. Finding the relationship between \(X_{1}\) and \(Y\) \[\hat{\beta_{1}}\]: How is an increase in \(X_1\) by one unit associated with a change in \(Y\) on average? - Positive? Negative? Independent? - How strong?

Statistical Significance in Estimated Beta Coefficients

  • What does it mean for a beta estimate \(\hat{\beta_{\,}}\) to be statistically significant at 5% level?
    • It means that the null hypothesis \(H_{0}: \beta = 0\) is rejected for a given significance level 5%.
    • “2 standard error rule” of thumb: The true value of \(\beta\) is 95% likely to be in the confidence interval \((\, \hat{\beta_{\,}} - 2 * \texttt{Std. Error}\;,\; \hat{\beta_{\,}} + 2 * \texttt{Std. Error} \,)\).
    • The standard error tells us how uncertain our beta estimate is.
    • We should look for the stars!

What Is Linear Regression Doing?

Prediction

2. Making a prediction on \(Y\): \[\hat{Y_{\,}}\] For unseen data point of \(X_1\), what is the predicted value of outcome, \(\hat{Y_{\,}}\)?

  • E.g., For \(X_{1} = 2\), the predicted outcome is \(\hat{Y_{\,}} = \hat{\beta_{0}} + \hat{\beta_{1}} \times 2\).
  • E.g., For \(X_{1} = 3\), the predicted outcome is \(\hat{Y_{\,}} = \hat{\beta_{0}} + \hat{\beta_{1}} \times 3\).

Linear Regression - Example

  • Suppose we want to predict a property’s sales price based on the property size.
    • In other words, for some house sale i, we want to predict sale_price[i] based on gross_square_feet[i].
  • We also want to focus on the relationship between a property’s sales price and a property size.
    • In other words, we estimate how an increase in gross_square_feet[i] is associated with sale_price[i].

Linear Relationship

  • Linear regression assumes that:

    • The outcome sale_price[i] is linearly related to the input gross_square_feet[i]:

\[\texttt{sale_price[i]} \;=\quad \texttt{b0} \,+\, \texttt{b1*gross_square_feet[i]} \,+\, \texttt{e[i]}\] where e[i] is a statistical error term.

The Linear Relationship between sale_price and gross_square_feet

Best Fitting Line

  • What do the vertical lines visualize?

Model Evaluation

Mean squared error (MSE)

  • One of the most common metrics used to measure the prediction accuracy of a linear regression model is MSE, which stands for mean squared error.
    • \(MSE\) is \(SSR\) divided by \(n\) (the number of observations in the data that are used in making predictions).

\[ MSE = SSR / n \] \[ \begin{align} SSR &\,=\, (\texttt{Residual_Error}_{1})^{2}\\ &\quad \,+\, (\texttt{Residual_Error}_{2})^{2}\\ &\quad\,+\, \cdots + (\texttt{Residual_Error}_{n})^{2} \end{align} \]

  • The lower MSE, the higher accuracy of the model.
  • The root MSE (RMSE) is the square root of MSE.

Model Evaluation

Mean squared error (MSE)

  • The root MSE (RMSE) represents the overall deviation of \(Y_{i}\) from the best fitting regression line.

Model Evaluation

R-squared

  • R-squared is a measure of how well the model “fits” the data, or its “goodness of fit.”

    • R-squared can be thought of as what fraction of the y’s variation is explained by the explanatory variables.
  • We want R-squared to be fairly large and R-squareds that are similar on testing and training.

  • Caution: R-squared will be higher for models with more explanatory variables, regardless of whether the additional explanatory variables actually improve the model or not.

Goals of Linear Regression

  • The goals of linear regression here are to:
  1. Modeling for explanation: Find the relationship between gross_square_feet and sale_price by estimating a true value of b1.
  • The estimated value of b1 is denoted by \(\hat{\texttt{b1}}\).
  1. Modeling for prediction: Make a prediction on sale_price[i] for new property i
  • The predicted value of sale_price[i] is denoted by \(\widehat{\texttt{sale_price}}\texttt{[i]}\), where

\[\widehat{\texttt{sale_price}}\texttt{[i]} \;=\quad \hat{\texttt{b0}} \,+\, \hat{\texttt{b1}}\texttt{*gross_square_feet[i]}\]

Training and Test Data

  • Training data: When we’re building a linear regression model, we need data to train the model.

  • Test data: We also need data to test whether the model works well on new data.

  • So, we start with splitting a given data.frame into training and test data.frames when building a linear regression model.

Training and Test Data

Training vs. Test

  • We use training data to train/fit the linear regression model.
    • We then make a prediction using test data, which are unseen/new from the viewpoint of the trained linear regression model.
  • In this way, we can test whether our model performs well in the real world, where unseen data points exist.

Training and Test Data

Overfitting

Training and Test Data

Model Construction and Evaluation

Splitting Data into Training and Testing Data

A Little Bit of Statistics for the Uniform Distribution

  • The probability density function for the uniform distribution looks like:

  • With the uniform distribution, any values of \(x\) between 0 and 1 is equally likely drawn.

  • We will use the uniform distribution when splitting data into training and testing data sets.

Splitting Data into Training and Testing Data

Randomization in the Sampling Process

  • Why do we randomize when splitting given data into training and test data?
    • Randomizing the sampling process ensures that the training and test data sets are representative for the population data.
    • If the sample does not properly represent the entire population, the model result is biased toward the sample.
  • Suppose the splitting process is not randomized, so that the observations with sale_price > 10^6 are in the training data and the observations with sale_price <= 10^6 are in the test data.
    • What would be the model result then?

Linear Regression using PySpark

Example of Linear Regression using PySpark

  • We will use the data for residential property sales from September 2017 and August 2018 in NYC.

  • Each sales data recorded contains a number of interesting variables, but here we focus on the followings:

    • sale_price: a property’s sales price;
    • gross_square_feet: a property’s size;
    • age: a property’s age;
    • borough_name: a borough where a property is located.
  • Use summary statistics and visualization to explore the data.

Splitting Data into Training and Testing Data

Step 1. Importing Modules and Reading DataFrames

import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.functions import rand, col, pow, mean, when
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
spark = SparkSession.builder.master("local[*]").getOrCreate()

# 1. Read CSV data from URL
df_pd = pd.read_csv('https://bcdanl.github.io/data/home_sales_nyc.csv')
sale_df = spark.createDataFrame(df_pd)
sale_df.show()

Splitting Data into Training and Testing Data

Step 2. rand(seed = ANY_NUMBER)

# 2. Split data into training and testing sets by creating a random column "gp"
sale_df = sale_df.withColumn("gp", rand(seed=123)) # seed is set for replication

# Splits 60-40 into training and test sets 
dtrain = sale_df.filter(col("gp") >= 0.4) 
dtest = sale_df.filter(col("gp") < 0.4)

# Or simply,
dtrain, dtest = sale_df.randomSplit([0.6, 0.4], seed = 123)

Building an ML DataFrame using VectorAssembler()

# Now assemble predictors using the renamed column
assembler1 = VectorAssembler(
    inputCols=["gross_square_feet"], 
    outputCol="predictors")
  • VectorAssembler is a transformer in PySpark’s ML library that is used to combine multiple columns into a single vector column.
    • Many ML algorithms in Spark require the predictors to be represented as a single vector.
    • VectorAssembler is often one of the first steps in a Spark ML pipeline.

Building an ML DataFrame using VectorAssembler()

dtrain1 = assembler1.transform(dtrain) # training data
dtest1  = assembler1.transform(dtest)  # test data

dtrain1.select("predictors", "sale_price").show()
dtest1.select("predictors", "sale_price").show()
  • VectorAssembler.transform() returns a DataFrame with a new column, specified in outputCol in VectorAssembler().

Building a Linear Regression Model using LinearRegression().fit()

# Fit linear regression model using the new label column "sale_price"
model1 = (
    LinearRegression(
        featuresCol = "predictors", 
        labelCol = "sale_price")
    .fit(dtrain1)
)
  • LinearRegression(featuresCol="predictors", labelCol="sale_price")
    • This creates an instance of the LinearRegression class.
    • The features (independent variables) are in a column named “predictors”.
    • The label (dependent variable) is in a column named “sale_price”.
  • .fit() trains (fits) the linear regression model using the training DataFrame dtrain1.
    • This training process estimates the beta coefficients that best predict the label from the features.

Building a Linear Regression Model using LinearRegression().fit()

# Adding a "prediction" coulmn to dtest1 DataFrame
dtest1  = model1.transform(dtest1)
  • dtest1 = model1.transform(dtest1): Adds a new column prediction to dtest1 DataFrame.
    • This new column contains the predicted outcome based on the trained model1 to predict an outcome for the test dataset dtest1.

Summary of the Regression Result

model1.intercept
model1.coefficients
model1.summary.coefficientStandardErrors
model1.summary.rootMeanSquaredError
model1.summary.r2

Summary of the Regression Result

Example of a Regression Table

Summary of the Regression Result

Make the Summary Pretty

import numpy as np
import scipy.stats as stats
from tabulate import tabulate

def regression_table(model, assembler):
    """
    Creates a formatted regression table from a fitted LinearRegression model and its VectorAssembler,
    and inserts a dashed horizontal line after the Intercept row. The table includes separate columns
    for the 95% confidence interval lower and upper bounds for each coefficient (computed at the 5% significance level)
    and an "Observations" row (using model.summary.numInstances) above the R² row.
    The RMSE row is placed as the last row.

    The columns are ordered as:
        Metric | Value | Significance | Std. Error | p-value | 95% CI Lower | 95% CI Upper

    For the "Value", "Std. Error", "95% CI Lower", and "95% CI Upper" columns, commas are inserted every three digits,
    with 3 decimal places (except for Observations which is formatted as an integer with commas).

    Parameters:
        model: A fitted LinearRegression model (with a .summary attribute).
        assembler: The VectorAssembler used to assemble the features for the model.

    Returns:
        A formatted string containing the regression table.
    """
    # Extract coefficients and standard errors as NumPy arrays
    coeffs = model.coefficients.toArray()
    std_errors_all = np.array(model.summary.coefficientStandardErrors)

    # Check if the intercept's standard error is included (one extra element)
    if len(std_errors_all) == len(coeffs) + 1:
        intercept_se = std_errors_all[0]
        std_errors = std_errors_all[1:]
    else:
        intercept_se = None
        std_errors = std_errors_all

    # Compute t-statistics for feature coefficients (t = beta / SE(beta))
    # t_stats = coeffs / std_errors
    t_stats = model.summary.tValues

    # Degrees of freedom: number of instances minus number of predictors minus 1 (for intercept)
    df = model.summary.numInstances - len(coeffs) - 1

    # Compute the t-critical value for a 95% confidence interval (two-tailed, 5% significance)
    t_critical = stats.t.ppf(0.975, df)

    # Compute two-tailed p-values for each feature coefficient
    # p_values = [2 * (1 - stats.t.cdf(np.abs(t), df)) for t in t_stats]
    p_values = model.summary.pValues

    # Function to assign significance stars based on p-value
    def significance_stars(p):
        if p < 0.01:
            return "***"
        elif p < 0.05:
            return "**"
        elif p < 0.1:
            return "*"
        else:
            return ""

    # Build the table rows.
    # Order: Metric, Value, Significance, Std. Error, p-value, 95% CI Lower, 95% CI Upper.
    table = []
    for feature, beta, se, p in zip(assembler.getInputCols(), coeffs, std_errors, p_values):
        ci_lower = beta - t_critical * se
        ci_upper = beta + t_critical * se
        table.append([
            "Beta: " + feature,       # Metric name
            beta,                     # Beta estimate (Value)
            significance_stars(p),    # Significance stars
            se,                       # Standard error
            p,                        # p-value
            ci_lower,                 # 95% CI lower bound
            ci_upper                  # 95% CI upper bound
        ])

    # Compute and add the intercept row with its SE, p-value, significance, and CI (if available)
    if intercept_se is not None:
        intercept_t = model.intercept / intercept_se
        intercept_p = 2 * (1 - stats.t.cdf(np.abs(intercept_t), df))
        intercept_sig = significance_stars(intercept_p)
        ci_intercept_lower = model.intercept - t_critical * intercept_se
        ci_intercept_upper = model.intercept + t_critical * intercept_se
    else:
        intercept_se = ""
        intercept_p = ""
        intercept_sig = ""
        ci_intercept_lower = ""
        ci_intercept_upper = ""

    table.append([
        "Intercept",
        model.intercept,
        intercept_sig,
        intercept_se,
        intercept_p,
        ci_intercept_lower,
        ci_intercept_upper
    ])

    # Append overall model metrics:
    # Insert an Observations row using model.summary.numInstances,
    # then an R² row, and finally the RMSE row as the last row.
    table.append(["Observations", model.summary.numInstances, "", "", "", "", ""])
    table.append(["R²", model.summary.r2, "", "", "", "", ""])
    table.append(["RMSE", model.summary.rootMeanSquaredError, "", "", "", "", ""])

    # Format the table.
    # For the "Value" (index 1), "Std. Error" (index 3), "95% CI Lower" (index 5), and "95% CI Upper" (index 6) columns,
    # format with commas and 3 decimal places, except for Observations which should be an integer with commas.
    # For the p-value (index 4), format to 3 decimal places.
    formatted_table = []
    for row in table:
        formatted_row = []
        for i, item in enumerate(row):
            if row[0] == "Observations" and i == 1 and isinstance(item, (int, float, np.floating)) and item != "":
                # Format Observations as integer with commas, no decimals.
                formatted_row.append(f"{int(item):,}")
            elif isinstance(item, (int, float, np.floating)) and item != "":
                if i in [1, 3, 5, 6]:
                    formatted_row.append(f"{item:,.3f}")
                elif i == 4:
                    formatted_row.append(f"{item:.3f}")
                else:
                    formatted_row.append(f"{item:.3f}")
            else:
                formatted_row.append(item)
        formatted_table.append(formatted_row)

    # Generate the table string using tabulate.
    table_str = tabulate(
        formatted_table,
        headers=["Metric", "Value", "Sig.", "Std. Error", "p-value", "95% CI Lower", "95% CI Upper"],
        tablefmt="pretty",
        colalign=("left", "right", "center", "right", "right", "right", "right")
    )

    # Insert a dashed line after the Intercept row for clarity.
    lines = table_str.split("\n")
    dash_line = '-' * len(lines[0])
    for i, line in enumerate(lines):
        if "Intercept" in line and not line.strip().startswith('+'):
            lines.insert(i+1, dash_line)
            break

    return "\n".join(lines)

# Example usage:
# print(regression_table(lr_model, assembler))

Summary of the Regression Result

Make the Summary Pretty

# Using the UDF, regression_table(model, assembler)
print(regression_table(model1, assembler1))

Linear Regression Model with Multiple Predictors

Multiple Regression

  • What if the regression were missing something?
    • Maybe prices are not just about size, but maybe there are certain parts of NYC that are categorically more expensive than other parts of NYC.
    • Maybe Manhattan is just more expensive than Bronx.
    • Maybe apartments are different than non-apartments.
    • Maybe old houses are different than new houses.
  • It is often helpful to bring in multiple explanatory variables—a Multivariate Regression.

Models and Assumptions

  • Linear regression assumes a linear relationship for \(Y = f(X_{1}, X_{2})\):

\[Y_{i} \,=\, \beta_{0} \,+\, \beta_{1} X_{1, i} \,+\,\beta_{2} X_{2, i} \,+\, \epsilon_{i}\] for \(i \,=\, 1, 2, \dots, n\), where \(i\) is the \(i\)-th observation in data.

  • \(\beta_0\) is an unknown true value of an intercept: average value for \(Y\) if \(X_{1} = 0\) and \(X_{2} = 0\)

  • \(\beta_1\) is an unknown true value of a slope: increase in average value for \(Y\) for each one-unit increase in \(X_{1}\)

  • \(\beta_2\) is an unknown true value of a slope: increase in average value for \(Y\) for each one-unit increase in \(X_{2}\)

Models and Assumptions

Random Noises

  • Linear regression assumes a linear relationship for \(Y = f(X_{1}, X_{2})\):

\[Y_{i} \,=\, \beta_{0} \,+\, \beta_{1} X_{1, i}\,+\, \beta_{1} X_{2, i} \,+\, \epsilon_{i}\] for \(i \,=\, 1, 2, \dots, n\), where \(i\) is the \(i\)-th observation in data.

  • \(\epsilon_i\) is a random noise, or a statistical error:

\[ \epsilon_i \sim N(0, \sigma^2) \]

  • Errors have a mean value of 0 with constant variance \(\sigma^2\).
  • Errors are uncorrelated with \(X_{1, i}\) and with \(X_{2, i}\).

Models and Assumptions

Best Fitting Plane

  • Linear regression finds the beta estimates \(( \hat{\beta_{0}}, \hat{\beta_{1}}, \hat{\beta_{2}} )\) such that:

    – The linear function \(f(X_{1}, X_{2}) = \hat{\beta_{0}} + \hat{\beta_{1}}X_{1} + \hat{\beta_{2}}X_{2}\) is as near as possible to \(Y\) for all \((X_{1, i}\,,\,X_{2, i}\,,\, Y_{i})\) pairs in the data.

    • It is the best fitting plane, or the predicted outcome, \(\hat{Y_{\,}} = \hat{\beta_{0}} + \hat{\beta_{1}}X_{1} + \hat{\beta_{2}}X_{2}\)

Models and Assumptions

Residual Errors

  • The estimated beta coefficients are chosen to minimize the sum of squares of the residual errors \((SSR)\): \[ \begin{align} SSR &\,=\, (\texttt{Residual_Error}_{1})^{2}\\ &\quad \,+\, (\texttt{Residual_Error}_{2})^{2}\\ &\quad\,+\, \cdots + (\texttt{Residual_Error}_{n})^{2}\\ \text{where}\qquad\qquad\qquad\qquad&\\ \texttt{Residual_Error}_{i} &\,=\, Y_{i} \,-\, \hat{Y_{i}},\\ \texttt{Predicted_Outcome}_{i}: \hat{Y_{i}} &\,=\, \hat{\beta_{0}} \,+\, \hat{\beta_{1}}X_{1, i} \,+\, \hat{\beta_{2}}X_{2, i} \end{align} \]

Multiple Regression - Best Fitting Plane

  • All else being equal, an increase in gross_square_feet by one unit is associated with an increase in sale_price by \(\hat{\beta_{1}}\).

Multiple Regression using PySpark

  • Let’s add a new predictor, age, to the model.
assembler2 = VectorAssembler(
                    inputCols=["gross_square_feet", "age"], 
                    outputCol="predictors")
dtrain2 = assembler2.transform(dtrain)
dtest2  = assembler2.transform(dtest)

model2 = LinearRegression(
                featuresCol="predictors",
                labelCol="sale_price")
        .fit(dtrain2)
dtest2 = model2.transform(dtest2) 

print(regression_table(model2, assembler2))

Dummy Variables in Linear Regression

Motivation: Treating Categorical Variables in Linear Regression

  • Linear regression models require numerical predictors, but many variables are categorical.

    • e.g., Consider the two equivalent houses, except for its location—one in Manhattan and the other in Bronx.
  • The Approach:
    Convert categorical variables into numerical format using dummy variables.

  • Why Do This?

    • Allows the model to compare different categories.
    • Each dummy variable indicates the presence (1) or absence (0) of a category.

What are Dummy Variables?

  • Definition: Binary indicators (0 or 1) representing categorical data.
  • Purpose: Transform qualitative data into a quantitative form for regression analysis.

Example:

\[ D_i = \begin{cases} 1, & \text{if the observation belongs to the category} \\ 0, & \text{otherwise} \end{cases} \]

Dummy Variables in Regression Models

Consider a regression model including a dummy variable:

\[ y_i = \beta_0 + \beta_1 x_i + \beta_2 D_i + \epsilon_i \]

  • \(x_i\): A continuous predictor.
  • \(D_i\): Dummy variable (e.g., political party affiliation, type of car).

Interpretation:
\(\beta_2\) captures the difference in the response \(y\) when the category is present (i.e., \(D_i=1\)) versus absent.

The Dummy Variable Trap

  • Problem: Including a dummy for every category introduces redundancy.

  • For a categorical variable with \(k\) levels:

    • If you include all \(k\) dummy variables in the model, their values always sum to 1:

    \[ D_{1i} + D_{2i} + \cdots + D_{ki} = 1 \quad \text{(for each observation)} \]

  • This is problematic, because one dummy is completely predictable from the others.

  • The intercept already captures the constant part (1), making one of the dummy variables redundant.

Avoiding the Dummy Variable Trap

  • Solution: Drop one dummy (often called the reference category)

  • The reference category is represented by a combination of \(\texttt{borough}\) variables.

    • Dummy for the reference category is 1 if all the rest of the dummies is 0.
    • Dummy for the reference category is 0 otherwise.

Proper model:

\[ y_i = \beta_0 + \beta_1 D_{1, i} + \beta_2 D_{2, i} + \cdots + \beta_{k-1} D_{(k-1), i} + \epsilon_i \]

  • Interpretation:
    • Each \(\beta_j\) (for \(j=1,2,\ldots,k-1\)) represents the difference from the reference category.

Dummy Variable Regression using PySpark

  • The UDF, add_dummy_variables(var_name, reference_level) convert a categorical variable into its dummy variables:
    • var_name: a string of categorical variable name
    • reference_level: index position of alphabetically sorted categories
    • This function requires dtrain and dtest DataFrames.
def add_dummy_variables(var_name, reference_level, category_order=None):
    """
    Creates dummy variables for the specified column in the global DataFrames dtrain and dtest.
    Allows manual setting of category order.

    Parameters:
        var_name (str): The name of the categorical column (e.g., "borough_name").
        reference_level (int): Index of the category to be used as the reference (dummy omitted).
        category_order (list, optional): List of categories in the desired order. If None, categories are sorted.

    Returns:
        dummy_cols (list): List of dummy column names excluding the reference category.
        ref_category (str): The category chosen as the reference.
    """
    global dtrain, dtest

    # Get distinct categories from the training set.
    categories = dtrain.select(var_name).distinct().rdd.flatMap(lambda x: x).collect()

    # Convert booleans to strings if present.
    categories = [str(c) if isinstance(c, bool) else c for c in categories]

    # Use manual category order if provided; otherwise, sort categories.
    if category_order:
        # Ensure all categories are present in the user-defined order
        missing = set(categories) - set(category_order)
        if missing:
            raise ValueError(f"These categories are missing from your custom order: {missing}")
        categories = category_order
    else:
        categories = sorted(categories)

    # Validate reference_level
    if reference_level < 0 or reference_level >= len(categories):
        raise ValueError(f"reference_level must be between 0 and {len(categories) - 1}")

    # Define the reference category
    ref_category = categories[reference_level]
    print("Reference category (dummy omitted):", ref_category)

    # Create dummy variables for all categories
    for cat in categories:
        dummy_col_name = var_name + "_" + str(cat).replace(" ", "_")
        dtrain = dtrain.withColumn(dummy_col_name, when(col(var_name) == cat, 1).otherwise(0))
        dtest = dtest.withColumn(dummy_col_name, when(col(var_name) == cat, 1).otherwise(0))

    # List of dummy columns, excluding the reference category
    dummy_cols = [var_name + "_" + str(cat).replace(" ", "_") for cat in categories if cat != ref_category]

    return dummy_cols, ref_category
    
  
# Example usage without category_order:
# dummy_cols_year, ref_category_year = add_dummy_variables('year', 0)

# Example usage with category_order:
# custom_order_wkday = ['sunday', 'monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday']
# dummy_cols_wkday, ref_category_wkday = add_dummy_variables('wkday', reference_level=0, category_order = custom_order_wkday)

Dummy Variable Regression using PySpark

  • Let’s check what categories are in “borough_name” and how many categories are:
# Distinct categories:
(
    dtrain
    .select("borough_name")
    .distinct()
    .orderBy("borough_name")
    .show()
)
# Number of categories
(
    dtrain
    .select("borough_name")
    .distinct()
    .count()
)
  • Let’s convert the borough_name variable into its dummies using the UDF, add_dummy_variables(var_name, reference_level):
# 0: Bronx, 1: Brooklyn, 2: Manhattan, 3: Queens, 4: Staten Island
dummy_cols, ref_category = add_dummy_variables("borough_name", 2)
# To see dummy variables
dtrain.select(['borough_name'] + dummy_cols).show()
dtrain.select(['borough_name'] + dummy_cols).filter(col('borough_name') == "Manhattan").show()

Dummy Variable Regression using PySpark

  • Now we’re ready to run a linear regression with dummy variables:
# Define the list of predictor columns:
# Two numeric predictors plus the dummy variables (excluding the reference dummy)
conti_cols = ["gross_square_feet", "age"]
assembler_predictors = conti_cols + dummy_cols

assembler3 = VectorAssembler(
    inputCols = assembler_predictors,
    outputCol = "predictors"
)
dtrain3 = assembler3.transform(dtrain)
dtest3  = assembler3.transform(dtest)
model3 = LinearRegression(featuresCol="predictors", labelCol="sale_price").fit(dtrain3)
dtest3 = model3.transform(dtest3)

# For model3 and assembler3:
print(regression_table(model3, assembler3))

Residual Plots

Residuals

  • Model equation: \(Y_{i} \,=\, \beta_{0} \,+\, \beta_{1}X_{1,i} \,+\, \beta_{2}X_{2,i}\)
    • \(\epsilon_i\) is a random noise, or a statistical error:

\[ \epsilon_i \sim N(0, \sigma^2) \] - Errors have a mean value of 0 with constant variance \(\sigma^2\).

  • Errors are uncorrelated with \(X_{1,i}\) and with \(X_{2, i}\).

Residuals

  • If we re-arrange the simple regression equation, \[\begin{align} {\epsilon}_{i} \,=\, Y_{i} \,-\, (\, {\beta}_{0} \,+\, {\beta}_{1}X_{1,i} \,). \end{align}\]

  • \(\texttt{residual_error}_{i}\) can be thought of as the expected value of \(\epsilon_{i}\), denoted by \(\hat{\epsilon_{i}}\). \[\begin{align} \hat{\epsilon_{i}} \,=\, Y_{i} \,-\, (\, \hat{\beta_{0}} \,+\, \hat{\beta_{1}}X_{1,i} \,) \end{align}\]

Residual Plots

  • Residual plot is a scatterplot of fitted values and residuals.

    • A variable of fitted values on x-axis
    • A variable of residuals on y-axis
  • A residual plot can be used to diagnose the quality of model results.

  • Because we assume that \(\epsilon_{i}\) have a mean value of 0 with constant variance \(\sigma^2\), a well-behaved residual plot should bounce randomly and form a cloud roughly at the level of zero residual, the perfect prediction line.

  • From the residual plot, we should ask the following the two questions ourselves:

    • On average, are the predictions correct?
    • Are there systematic errors?

Residual Plots

  • On average, are the predictions correct?
  • Are there systematic errors?

Residual Plots

Unbiasedness and Homoskedasticity

  • We would like have a residual plot to be
    • Unbiased: have an average value of zero in any thin vertical strip;
    • Homoskedastic, which means “same stretch”: it is ideal to have the same spread of the residuals in any thin vertical strip.
    • When the variance of residuals changes across predicted values (e.g., residuals get larger as predicted values increase), the model suffers from heteroscedasticity.

Residual Plots

Examples

Residual Plots

What happens if biased?

  • The model consistently overpredicts or underpredicts for certain values.
    • Indicates that the model might be misspecified—perhaps missing important variables or using an incorrect functional form.
    • Leads to biased parameter estimates, meaning the coefficients are systematically off, reducing the validity of predictions and inferences.

Residual Plots

What happens if heteroscedasticity is present?

  • Consequences of heteroscedasticity:
    • 📉 Inefficient coefficient estimates: Estimates remain unbiased but are no longer efficient (i.e., they don’t have the smallest variance).
    • Biased standard errors: Leads to unreliable p-values and confidence intervals, potentially resulting in invalid hypothesis tests.
    • ⚠️ Misleading inferences: Predictors may appear statistically significant or insignificant incorrectly.
    • 🎯 Poor predictive performance: The model might perform poorly on future data, especially if the residual variance grows with higher predicted values.

Residual Plots in Python

  • PySpark itself does not have built-in visualization capabilities.
    • Firstly, we convert PySpark DataFrame into Pandas DataFrame by using .toPandas()
# Residual plot for Model 2:
# Convert test predictions to Pandas
rdf = dtest2.select(["prediction", "sale_price"]).toPandas()
rdf["residual"] = rdf["sale_price"] - rdf["prediction"]

Residual Plots in Python

  • We then use matplotlib.pyplot to draw a residual plot:
import matplotlib.pyplot as plt
import statsmodels.api as sm  # for lowess smoothing

plt.scatter(rdf["prediction"], rdf["residual"], alpha=0.2, color="darkgray")

# Use lowess smoothing for the trend line
smoothed = sm.nonparametric.lowess(rdf["residual"], rdf["prediction"])
plt.plot(smoothed[:, 0], smoothed[:, 1], color="darkblue")

# Perfect prediction line
#   A horizontal line at residual = 0
plt.axhline(y=0, color="red", linestyle="--")

# Labeling
plt.xlabel("Predicted sale.price (Model 2)")
plt.ylabel("Residuals")
plt.title("Residual Plot for Model 2")
plt.show()

Hypothesis Testing on Beta Coefficient

Hypothesis Testing on Beta Coefficient

  • To determine whether an independent variable has a statistically significant effect on the dependent variable in a linear regression model.

  • Consider the following linear regression model: \[ y_{i} = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + \dots + \beta_k x_{k,i} + \epsilon_{i} \]

  • \(y\): Outcome

  • \(x_1, x_2, \dots, x_k\): Predictors

  • \(\beta_0\): Intercept

  • \(\beta_1, \beta_2, \dots, \beta_k\): Coefficients

  • \(\epsilon_{i}\): Error term

Hypotheses and Test Statistic

  • We test whether a specific coefficient \(\beta_j\) significantly differs from zero:
    • Null Hypothesis (\(H_0\)): \(\beta_j = 0\) (No effect)
    • Alternative Hypothesis (\(H_A\)): \(\beta_j \neq 0\) (Significant effect)
  • The t-statistic is used to test each coefficient:

\[ t = \frac{\hat{\beta_j} - 0}{SE(\hat{\beta_j})} \]

  • \(\hat{\beta_j}\): Estimated coefficient
  • \(SE(\hat{\beta_j})\): Standard error of the estimate

Hypothesis Testing - Decision Rule and Interpretation

  1. Calculate the p-value based on the t-distribution with \(n - k - 1\) degrees of freedom.
  2. Compare p-value with significance level \(\alpha\) (e.g., 0.05):
    • If \(p \leq \alpha\): Reject \(H_0\) (Significant)
    • If \(p > \alpha\): Fail to reject \(H_0\) (Not significant)
  • In our course, stars in a regression table mean a significance level:
    • * (10%); ** (5%); *** (1%)
  • Reject \(H_0\): There is sufficient evidence to suggest a statistically significant relationship between \(x_j\) and \(y\).
  • Fail to reject \(H_0\): No statistically significant evidence of a relationship \(x_j\) affects \(y\).

Interpreting Beta Estimates

Interpreting Estimated Beta Coefficients

Example

The model equation is \[\begin{align} \texttt{sale_price[i]} \;=\;\, &\texttt{b0} \,+\,\\ &\texttt{b1*gross_square_feet[i]} \,+\,\texttt{b2*age[i]}\,+\,\\ &\texttt{b3*Bronx[i]} \,+\,\texttt{b4*Brooklyn[i]} \,+\,\\&\texttt{b5*Queens[i]} \,+\,\texttt{b6*Staten Island[i]}\,+\,\\ &\texttt{e[i]} \end{align}\] - The reference level of borough_name variables is Manhattan.

Interpreting Estimated Beta Coefficients

1. gross_square_feet

  • Consider the predicted sales prices of the two houses, A and B.
    • Both A and B are in Bronx and with the same age.
    • gross_square_feet of house A is 2001, while that of house B is 2000.
  • All else being equal, an increase in gross_square_feet by one unit is associated with an increase in sale_price by \(\hat{\beta_{1}}\).
    • Why?

Interpreting Estimated Beta Coefficients

1. gross_square_feet

\[ \begin{align}\widehat{\texttt{sale_price[A]}} \;=\quad& \hat{\texttt{b0}} \,+\, \hat{\texttt{b1}}\texttt{*gross_square_feet[A]} \,+\, \hat{\texttt{b2}}\texttt{*age[A]} \,+\,\\ &\hat{\texttt{b3}}\texttt{*Bronx[A]}\,+\,\hat{\texttt{b4}}\texttt{*Brooklyn[A]} \,+\,\\ &\hat{\texttt{b5}}\texttt{*Queens[A]}\,+\, \hat{\texttt{b6}}\texttt{*Staten Island[A]}\\ \widehat{\texttt{sale_price[B]}} \;=\quad& \hat{\texttt{b0}} \,+\, \hat{\texttt{b1}}\texttt{*gross_square_feet[B]} \,+\, \hat{\texttt{b2}}\texttt{*age[B]}\,+\,\\ &\hat{\texttt{b3}}\texttt{*Bronx[B]}\,+\, \hat{\texttt{b4}}\texttt{*Brooklyn[B]} \,+\,\\ &\hat{\texttt{b5}}\texttt{*Queens[B]}\,+\, \hat{\texttt{b6}}\texttt{*Staten Island[B]} \end{align} \]

\[ \begin{align}\Leftrightarrow\qquad&\widehat{\texttt{sale_price[A]}} \,-\, \widehat{\texttt{sale_price[B]}}\qquad \\ \;=\quad &\hat{\texttt{b1}}\texttt{*}(\texttt{gross_square_feet[A]} - \texttt{gross_square_feet[B]})\\ \;=\quad &\hat{\texttt{b1}}\texttt{*}\texttt{(2001 - 2000)} \,=\, \hat{\texttt{b1}}\qquad\qquad\quad\;\; \end{align} \]

Interpreting Estimated Beta Coefficients

2. borough_nameBronx

  • Consider the predicted sales prices of the two houses, A and C.

    • Both A and C are with the same age and the same gross_square_feet.
    • A is in Bronx, and C is in Manhattan.
  • All else being equal, an increase in borough_nameBronx by one unit is associated with an increase in sale_price by b3.

  • Equivalently, all else being equal, being in Bronx relative to being a in Manhattan is associated with a decrease in sale_price by |b3|.

    • Why?

Interpreting Estimated Beta Coefficients

2. borough_nameBronx

\[ \begin{align}\widehat{\texttt{sale_price[A]}} \;=\quad& \hat{\texttt{b0}} \,+\, \hat{\texttt{b1}}\texttt{*gross_square_feet[A]} \,+\, \hat{\texttt{b2}}\texttt{*age[A]} \,+\,\\ &\hat{\texttt{b3}}\texttt{*Bronx[A]}\,+\, \hat{\texttt{b4}}\texttt{*Brooklyn[A]} \,+\,\\ &\hat{\texttt{b5}}\texttt{*Queens[A]}\,+\, \hat{\texttt{b6}}\texttt{*Staten Island[A]}\\ \widehat{\texttt{sale_price[C]}} \;=\quad& \hat{\texttt{b0}} \,+\, \hat{\texttt{b1}}\texttt{*gross_square_feet[C]} \,+\, \hat{\texttt{b2}}\texttt{*age[C]}\,+\,\\ &\hat{\texttt{b3}}\texttt{*Bronx[C]}\,+\, \hat{\texttt{b4}}\texttt{*Brooklyn[C]} \,+\,\\ &\hat{\texttt{b5}}\texttt{*Queens[C]}\,+\, \hat{\texttt{b6}}\texttt{*Staten Island[C]} \end{align} \]

\[ \begin{align}\Leftrightarrow\qquad&\widehat{\texttt{sale_price[A]}} \,-\, \widehat{\texttt{sale_price[C]}}\qquad \\ \;=\quad &\hat{\texttt{b3}}\texttt{*}\texttt{Bronx[A]} \\ \;=\quad &\hat{\texttt{b3}}\qquad\qquad\qquad\qquad\quad\;\;\;\,\end{align} \]

Coefficient Plots in Python

  • A coefficient plot shows beta estimates and their confidence intervals.
import matplotlib.pyplot as plt
import pandas as pd

# Create a Pandas DataFrame from model3's summary information
terms = assembler3.getInputCols()
coefs = model3.coefficients.toArray()[:len(terms)]
stdErrs = model3.summary.coefficientStandardErrors[:len(terms)]

df_summary = pd.DataFrame({
    "term": terms,
    "estimate": coefs,
    "std_error": stdErrs
})

# Plot using the DataFrame columns
plt.errorbar(df_summary["term"], df_summary["estimate"],
             yerr = 1.96 * df_summary["std_error"], fmt='o', capsize=5)
plt.xlabel("Terms")
plt.ylabel("Coefficient Estimate")
plt.title("Coefficient Estimates (Model 2)")
plt.axhline(0, color="red", linestyle="--")  # Add horizontal line at 0
plt.xticks(rotation=45)
plt.show()

Linear Regression with Log Transformation

Linear Regression with Log Transformation

The model equation with log-transformed \(\texttt{sale.price[i]}\) is \[\begin{align} \log(\texttt{sale.price[i]}) \;=\;\, &\texttt{b0} \,+\,\\ &\texttt{b1*gross.square.feet[i]} \,+\,\texttt{b2*age[i]}\,+\,\\ &\texttt{b3*Bronx[i]} \,+\,\texttt{b4*Brooklyn[i]} \,+\,\\&\texttt{b5*Queens[i]} \,+\,\texttt{b6*Staten Island[i]}\,+\,\\ &\texttt{e[i]}. \end{align}\]

  • Note that the reference level for borough_name is Manhattan.

Beta Estimates for Log-transformed Variables

1. gross_square_feet

  • Let’s re-consider the two properties \(\texttt{A}\) and \(\texttt{B}\).
    • \(\texttt{gross.square.feet[A]} = 2001\) and \(\texttt{gross.square.feet[B]} = 2000\).
    • Both are in the same borough.
    • Both properties’ ages are the same.

Beta Estimates for Log-transformed Variables

1. gross_square_feet

  • If we apply the rule above for \(\widehat{\texttt{sale.price}}\texttt{[A]}\) and \(\widehat{\texttt{sale.price}}\texttt{[B]}\),

\[\begin{align}&\log(\widehat{\texttt{sale.price}}\texttt{[A]}) - \log(\widehat{\texttt{sale.price}}\texttt{[B]}) \\ \,=\, &\hat{\texttt{b1}}\,*\,(\texttt{gross.square.feet[A]} \,-\, \texttt{gross.square.feet[B]})\\ \,=\, &\hat{\texttt{b1}}\end{align}\]

So we can have the following: \[ \begin{align} &\Leftrightarrow\qquad\frac{\widehat{\texttt{sale.price[A]}}}{ \widehat{\texttt{sale.price[B]}}} \;=\; \texttt{exp(}\hat{\texttt{b1}}\texttt{)}\\ \quad&\Leftrightarrow\qquad\widehat{\texttt{sale.price[A]}} \;=\; \widehat{\texttt{sale.price[B]}} * \texttt{exp(}\hat{\texttt{b1}}\texttt{)} \end{align} \]

Beta Estimates for Log-transformed Variables

1. gross_square_feet

  • Suppose \(\texttt{exp(}\hat{\texttt{b2}}\texttt{)} = 1.000431\).
    • Then \(\widehat{\texttt{sale.price[A]}}\) is \(1.000431\times\widehat{\texttt{sale.price[B]}}\).
    • All else being equal, an increase in \(\texttt{gross.square.feet}\) by one unit is associated with an increase in \(\texttt{sale.price}\) by 0.0431%.

Beta Estimates for Log-transformed Variables

2. borough_nameBronx

  • Let’s re-consider the two properties \(\texttt{A}\) and \(\texttt{C}\).
    • A is in Bronx, and C is in Manhattan.
    • Both A and C’s age are the same.
    • Both A and C’s gross.square.feet are the same.

Beta Estimates for Log-transformed Variables

2. borough_nameBronx

  • If we apply the log()-exp() rules for \(\widehat{\texttt{sale.price}}\texttt{[A]}\) and \(\widehat{\texttt{sale.price}}\texttt{[C]}\),

\[\begin{align}&\log(\widehat{\texttt{sale.price}}\texttt{[A]}) - \log(\widehat{\texttt{sale.price}}\texttt{[C]}) \\ \,=\, &\hat{\texttt{b3}}\,*\,(\texttt{borough_Bronx[A]} \,-\, \texttt{borough_Bronx[C]})\,=\, \hat{\texttt{b3}}\end{align}\]

So we can have the following: \[ \begin{align}&\Leftrightarrow\qquad\frac{\widehat{\texttt{sale.price[A]}}}{ \widehat{\texttt{sale.price[C]}}} \;=\; \texttt{exp(}\hat{\texttt{b3}}\texttt{)}\\ \quad&\Leftrightarrow\qquad\,\widehat{\texttt{sale.price[A]}} \;=\; \widehat{\texttt{sale.price[C]}} * \texttt{exp(}\hat{\texttt{b3}}\texttt{)} \end{align} \]

Beta Estimates for Log-transformed Variables

2. borough_nameBronx

  • Suppose \(\texttt{exp(}\hat{\texttt{b3}}\texttt{)} = 0.2831691\).
    • Then \(\widehat{\texttt{sale.price[A]}}\) is \(0.2831691\times\widehat{\texttt{sale.price[B]}}\).
    • All else being equal, an increase in \(\texttt{borough_Bronx}\) by one unit is associated with a decrease in \(\texttt{sale.price}\) by (1 - 0.2831691) = 71.78%.
    • All else being equal, being in Bronx relative to being in Manhattan is associated with a decrease in \(\texttt{sale.price}\) by 71.78%.

Beta Estimates for Log-transformed Variables

  • All else being equal, an increase in gross.square.feet by one unit is associated with an increase in log(sale.price) by b1.
    • All else being equal, an increase in gross.square.feet by one unit is associated with an increase in sale.price by (exp(b1) - 1) * 100%.
  • All else being equal, an increase in broughBronx by one unit is associated with an increase in log(sale.price) by b3.
    • All else being equal, being in Bronx is associated with a decrease in sale.price by |(exp(b3) - 1)| * 100% relative to being in Manhattan.

Log-transformed Variables in PySpark

  • We use the log() function to add a log-transformed variable:
from pyspark.sql.functions import log
sale_df = sale_df.withColumn("log_sale_price", 
                              log( sale_df['sale_price'] ) ) 
  • To use a exponential function, we can use an exp() function from numpy:
import numpy as np

np.exp(1)
np.exp([1, 2, 3])