library(tidyverse)
library(broom)
library(stargazer)
library(margins)
library(yardstick)
library(WVPlots)
library(pROC)Classwork 4
Logistic Regression I
0.1 Packages
0.2 Data
titanic <- read_csv("https://bcdanl.github.io/data/titanic_details.csv")
titanic |>
rmarkdown::paged_table()0.3 Part A. Quick EDA (Warm-up)
0.3.1 A1) Outcome rate
- Compute the overall survival rate.
- Report it as a proportion and as a percent.
0.3.2 A2) Basic group comparisons
Compute survival rates by:
gender
class
Question: Which group(s) appear to have higher survival rates?
0.4 Part B. Logistic Regression Models
We model:
\[ \Pr(\text{survived}=1 \mid X) = \frac{\text{exp(z)}}{1 + \text{exp}(z)} \] where \(z\) is a linear combination of predictors.
0.4.1 B1) Create a split
Use an 80/20 split with a fixed seed.
0.4.2 B2) Baseline model
Fit a logistic regression model1:
- outcome:
survived - predictors:
gender+class
Questions:
- What does a positive coefficient mean in a logistic regression?
- Why is it hard to interpret coefficients directly in probability units?
0.5 Part C. Average Marginal Effects (AME)
Recall: an AME is the average change in predicted probability when a predictor increases by 1 unit (or changes 0β1 for a dummy), averaging over the sample.
0.5.1 C1) Compute AMEs
Compute AMEs for gender and class.
Questions:
- Interpret the AME for
gender(be precise about which category is the reference). - Interpret the AME for
class.
0.5.2 C2) AMEs for a richer model
Fit a richer model model2 with the following predictors:
genderclassagefare
Compute AMEs for gender, class, age, fare:
Questions:
- Interpret the AME for
agein percentage points per year. - Interpret the AME for
fare. Rescale your interpretation to a meaningful change (e.g., per $10 increase in fare). - Did the AME for
genderchange frommodel1tomodel2? Why might it change?
0.6 Part D. Prediction
Use model2 to predict probabilities on the test set.
0.7 Part E. Classification at a Threshold
We convert probabilities into class predictions:
\[ \hat{y} = \begin{cases} 1 & \text{if } \hat{p} \ge t \\\\ 0 & \text{if } \hat{p} < t \end{cases} \]
0.7.1 E1) Confusion matrix at \(t=0.5\)
Compute:
- accuracy
- sensitivity (recall for class 1)
- specificity
- precision
Questions:
- If the dataset is imbalanced, why can accuracy be misleading?
- Which error is βworseβ here: false positive or false negative? Explain.
0.7.2 E2) Try different thresholds
Repeat the metrics for thresholds:
- 0.3
- 0.5
- 0.7
Questions:
- What happens to recall when you lower the threshold?
- What happens to precision when you raise the threshold?
0.8 Part F. ROC Curve and AUC
Questions:
- What does AUC measure in plain English?
- Can a model have a good AUC but poor accuracy at threshold 0.5? Why?
1 Discussion
Welcome to our Classwork 4 Discussion Board! π
This space is designed for you to engage with your classmates about the material covered in Classwork 4.
Whether you are looking to delve deeper into the content, share insights, or have questions about the content, this is the perfect place for you.
If you have any specific questions for Byeong-Hak (@bcdanl) regarding the Classwork 4 materials or need clarification on any points, donβt hesitate to ask here.
All comments will be stored here.
Letβs collaborate and learn from each other!