Classwork 11
Addressing Quasi-Separation in Logistic Regression with Regularization
Setup for PySpark, UDFs, and Plots
Required Libraries and SparkSession
Entry Point
# Below is for an interactive display of Pandas DataFrame in Colab
from google.colab import data_table
data_table.enable_dataframe_formatter()
import pandas as pd
import numpy as np
from tabulate import tabulate # for table summary
import scipy.stats as stats
from scipy.stats import norm
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm # for lowess smoothing
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import roc_curve
from pyspark.sql import SparkSession
from pyspark.sql.functions import rand, col, pow, mean, avg, when, log, sqrt, exp
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression, GeneralizedLinearRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
= SparkSession.builder.master("local[*]").getOrCreate() spark
UDF for Adding Dummy Variables
def add_dummy_variables(var_name, reference_level, category_order=None):
"""
Creates dummy variables for the specified column in the global DataFrames dtrain and dtest.
Allows manual setting of category order.
Parameters:
var_name (str): The name of the categorical column (e.g., "borough_name").
reference_level (int): Index of the category to be used as the reference (dummy omitted).
category_order (list, optional): List of categories in the desired order. If None, categories are sorted.
Returns:
dummy_cols (list): List of dummy column names excluding the reference category.
ref_category (str): The category chosen as the reference.
"""
global dtrain, dtest
# Get distinct categories from the training set.
= dtrain.select(var_name).distinct().rdd.flatMap(lambda x: x).collect()
categories
# Convert booleans to strings if present.
= [str(c) if isinstance(c, bool) else c for c in categories]
categories
# Use manual category order if provided; otherwise, sort categories.
if category_order:
# Ensure all categories are present in the user-defined order
= set(categories) - set(category_order)
missing if missing:
raise ValueError(f"These categories are missing from your custom order: {missing}")
= category_order
categories else:
= sorted(categories)
categories
# Validate reference_level
if reference_level < 0 or reference_level >= len(categories):
raise ValueError(f"reference_level must be between 0 and {len(categories) - 1}")
# Define the reference category
= categories[reference_level]
ref_category print("Reference category (dummy omitted):", ref_category)
# Create dummy variables for all categories
for cat in categories:
= var_name + "_" + str(cat).replace(" ", "_")
dummy_col_name = dtrain.withColumn(dummy_col_name, when(col(var_name) == cat, 1).otherwise(0))
dtrain = dtest.withColumn(dummy_col_name, when(col(var_name) == cat, 1).otherwise(0))
dtest
# List of dummy columns, excluding the reference category
= [var_name + "_" + str(cat).replace(" ", "_") for cat in categories if cat != ref_category]
dummy_cols
return dummy_cols, ref_category
# Example usage without category_order:
# dummy_cols_year, ref_category_year = add_dummy_variables('year', 0)
# Example usage with category_order:
# custom_order_wkday = ['sunday', 'monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday']
# dummy_cols_wkday, ref_category_wkday = add_dummy_variables('wkday', reference_level=0, category_order = custom_order_wkday)
Setup for scikit-learn
and Plots
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.metrics import (confusion_matrix, accuracy_score, precision_score,
recall_score, roc_curve, roc_auc_score)
Question 1
= pd.read_csv('https://bcdanl.github.io/data/car-data.csv') dfpd
- Convert the
dfpd
Pandas DataFrame into the PySpark DataFrame object with the name,df
.
Variable description
Variable | Description |
---|---|
buying |
Buying price of the car (vhigh, high, med, low) |
maint |
Maintenance cost (vhigh, high, med, low) |
doors |
Number of doors (2, 3, 4, 5more) |
persons |
Capacity in terms of persons to carry (2, 4, more) |
lug_boot |
Size of luggage boot (small, med, big) |
safety |
Estimated safety of the car (low, med, high) |
rating |
Car acceptability (unacc, acc, good, vgood) |
fail |
TRUE if the car is unacceptable (unacc), otherwise FALSE |
Question 2
- Divide the
df
DataFrame into training and test DataFrames.- Use
dtrain
anddtest
for training and test DataFrames, respectively. - 70% of observations in the
df
are assigned todtrain
; the rest is assigned todtest
.
- Use
Question 3
Fit the following regression model:
\[ \begin{align} &\quad\;\; \text{Prob}(\text{fail}_{i} = 1) \\ &= G\Big(\beta_{0} \\ &\qquad\quad\;\;\; \,+\, \beta_{4} \text{buying\_med}_{i} \,+\, \beta_{4} \text{buying\_high}_{i} \,+\, \beta_{4} \text{buying\_vhigh}_{i} \\ &\qquad\quad\;\;\; \,+\, \beta_{4} \text{maint\_med}_{i} \,+\, \beta_{4} \text{maint\_high}_{i} \,+\, \beta_{4} \text{maint\_vhigh}_{i} \\ &\qquad\quad\;\;\; \,+\, \beta_{7} \text{persons\_4}_{i} \,+\, \beta_{8} \text{persons\_more}_{i} \\ &\qquad\quad\;\;\; \,+\, \beta_{10} \text{lug\_boot\_med}_{i}\,+\, \beta_{10} \text{lug\_boot\_big}_{i} \\ &\qquad\quad\;\;\; \,+\, \beta_{11} \text{safety\_med}_{i}\,+\, \beta_{11} \text{safety\_high}_{i} \Big), \end{align} \]
where \(G(\,\cdot\,)\) is
\[ G(\,\cdot\,) = \frac{\exp(\,\cdot\,)}{1 + \exp(\,\cdot\,)}. \]
Provide the summary of the regression result.
- Set the reference levels accordingly.
Question 4
- How are coefficient estimates?
Question 5
- Calculate the followings:
- Confusion matrix with the appropriate threshold level.
- Accuracy
- Precision
- Recall
- Specificity
- Average rate of at-risk babies
- Enrichment
Question 6
Visualize the variation in recall and enrichment across different threshold levels.
Question 7
- Draw the receiver operating characteristic (ROC) curve.
- Calculate the area under the curve (AUC).
Question 8
- Use
sklearn
to fit a Lasso logistic regression.- Repeat Questions 3-7.
Question 9
- Use
sklearn
to fit a Ridge logistic regression.- Repeat Questions 3-7.
Question 10
- Use
sklearn
to fit a Elastic Net logistic regression.- Repeat Questions 3-7.