Logistic Regression

Machine Learning Lab

Author

Byeong-Hak Choe

Modified

May 8, 2024



Logistic Regression Model

  • In a regression model, we model the conditional mean for \(y\) given \(x\) as

\[ \begin{align} \mathbb{E}[\, y \,|\, \mathbf{X} \,] &= \beta_{0} \,+\, \beta_{1}\,x_{1} \,+\, \cdots \,+\, \beta_{p}\,x_{p}\\ { }\\ y_{i} &= \beta_{0} \,+\, \beta_{1}\,x_{1, i} \,+\, \cdots \,+\, \beta_{p}\,x_{p, i} + \epsilon_{i} \quad \text{for } i = 1, 2, ..., n \end{align} \] - Logistic regression is used to model a binary response:

  • \(y\) is either 1 or 0 (e.g., True or False).

  • \(\epsilon_{i}\) is the random noise from logistic distribution whose cumulative density function is as follows:

knitr::include_graphics('https://bcdanl.github.io/lec_figs/mba-1-11.png')

where \(z\) is a linear combination of explanatory variables \(\mathbf{X}\).

\[ f(z) = \frac{e^{z}}{1 + e^{z}} \] - Since \(y\) is either 0 or 1, the conditional mean value of \(y\) is the probability:

\[ \begin{align} \mathbb{E}[\, y \,|\, \mathbf{X} \,] &= \text{Pr}(y = 1 | \mathbf{X}) \times 1 \,+\,\text{Pr}(y = 0 | \mathbf{X}) \times 0\\ &= \text{Pr}(y = 1 | \mathbf{X}) \\ { }\\ \text{Pr}(y = 1 | \mathbf{X}) &= f(\beta_{0} \,+\, \beta_{1}\,x_{1} \,+\, \cdots \,+\, \beta_{p}\,x_{p})\\ &= \frac{e^{\beta_{0} \,+\, \beta_{1}\,x_{1} \,+\, \cdots \,+\, \beta_{p}\,x_{p}}}{1 + e^{\beta_{0} \,+\, \beta_{1}\,x_{1} \,+\, \cdots \,+\, \beta_{p}\,x_{p}}} \\ \end{align} \]

  • The logistic regression finds the beta coefficients, \(b_{0}, b_{1}, \cdots\), such that the logistic function ranging from 0 to 1

\[ \begin{align} f( b_{0} + b_{1}*x_{i,1} + b_{2}*x_{i,2} + \cdots )\notag \end{align} \]

is the best possible estimate of the binary outcome \(y_{i}\).


Interpretation of Beta Estimates

  • In logistic regression, the effect of \(x_{1}\) on \(Pr(y_{i} = 1 | \mathbf{X})\) is different across observations \(i = 1, 2, \cdots\):
knitr::include_graphics('https://bcdanl.github.io/lec_figs/effect-linear-logit.png')


The Goals of Logistic Regression:

  1. Modeling for prediction (\(\text{Pr}({y} | \mathbf{X})\)): When we want to predict an outcome variable \(y = 0 ,1\) based on the information contained in a set of predictor variables \(\mathbf{X}\).
  • We are estimating the conditional expectation (mean) for \(y\): \[ \text{Pr}(\, y \,|\, \mathbf{X} \,) = f(\beta_{0} \,+\, \beta_{1}\,x_{1} \,+\, \cdots \,+\, \beta_{p}\,x_{p}). \]

  • which is the probability that \(y = 1\) given the value for \(X\) from the logistic function.

  • Prediction from the logistic regression with a threshold on the probabilities can be used as a classifier.

    • If the probability that the newborn baby i is at risk is greater than the threshold \(\theta\in (0, 1)\) (\(\text{Pr}(y_{i} = 1 | \mathbf{X}) > \theta\)), the baby i is classified as at-risk.
  • We can discuss the performance of classifiers later.

    • Accuracy: When the classifier says this newborn baby is at risk or is not at risk, what is the probability that the model is correct?
    • Precision: If the classifier says this newborn baby is at risk, whatโ€™s the probability that the baby is really at risk?
    • Recall: Of all the babies at risk, what fraction did the classifier detect?
    • There is a trade-off between recall and precision.


  1. Modeling for explanation (\(\hat{\beta}\)): When we want to explicitly describe and quantify the relationship between the outcome variable \(y\) and a set of explanatory variables \(\mathbf{X}\).
  • We can average the marginal effects across the training data (average marginal effect, or AME).

  • We can obtain the marginal effect at an average observation or representative observations in the training data (marginal effect at the mean or at representative values).

  • We can also consider:

  • The AME for subgroup of the data

  • The AME at the mean value of VARIABLE.


Example: Newbron Babies at risk

  • Weโ€™ll use a sample dataset from the 2010 CDC natality public-use data.
    • The data set records information about all US births, including risk factors about the mother and father, and about the delivery.
    • Newborn babies are assessed at one and five minutes after birth to determine if a baby needs immediate emergency care or extra medical attention.

Task 1. Identify the effects of several risk factors on the probability of atRisk == TRUE. Task 2. Classify ahead of time babies with a higher probability of atRisk == TRUE.


Load Packages and Data

# install.packages("margins")
library(tidyverse)
library(margins) # for AME
library(hrbrthemes) # for ggplot theme, theme_ipsum()
library(stargazer)

theme_set(theme_ipsum()) # setting theme_ipsum() default
load(url("https://bcdanl.github.io/data/NatalRiskData.rData"))

# 50:50 split between training and testing data
train <- filter(sdata,
                ORIGRANDGROUP <= 5)
test <- filter(sdata,
                ORIGRANDGROUP > 5)
skim(train)
Data summary
Name train
Number of rows 14212
Number of columns 15
_______________________
Column type frequency:
factor 2
logical 9
numeric 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
GESTREC3 0 1 FALSE 2 >= : 12651, < 3: 1561
DPLURAL 0 1 FALSE 3 sin: 13761, twi: 424, tri: 27

Variable type: logical

skim_variable n_missing complete_rate mean count
CIG_REC 0 1 0.09 FAL: 12913, TRU: 1299
ULD_MECO 0 1 0.05 FAL: 13542, TRU: 670
ULD_PRECIP 0 1 0.03 FAL: 13854, TRU: 358
ULD_BREECH 0 1 0.06 FAL: 13316, TRU: 896
URF_DIAB 0 1 0.05 FAL: 13450, TRU: 762
URF_CHYPER 0 1 0.01 FAL: 14046, TRU: 166
URF_PHYPER 0 1 0.04 FAL: 13595, TRU: 617
URF_ECLAM 0 1 0.00 FAL: 14181, TRU: 31
atRisk 0 1 0.02 FAL: 13939, TRU: 273

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
PWGT 0 1 153.28 38.87 74 125 145 172 375 โ–†โ–‡โ–‚โ–โ–
UPREVIS 0 1 11.17 4.02 0 9 11 13 49 โ–ƒโ–‡โ–โ–โ–
DBWT 0 1 3276.02 582.91 265 2985 3317 3632 6165 โ–โ–โ–‡โ–‚โ–
ORIGRANDGROUP 0 1 2.53 1.70 0 1 3 4 5 โ–‡โ–…โ–…โ–…โ–…


Linear Probability Model (LPM)

# linear probability model
lpm <- lm(atRisk ~ CIG_REC + GESTREC3 + DPLURAL + 
               ULD_MECO + ULD_PRECIP + ULD_BREECH + 
               URF_DIAB + URF_CHYPER + URF_PHYPER + URF_ECLAM, 
             data = train)
  • LPM often works well when it comes to identifying AME.

  • Caveats

    • Probability of \(y = 1\) can be beyond [0, 1].
    • The error canโ€™t be distributed Normal with \(y_{i} \in \{0 , 1\}\).
    • When using LPM, we should make standard errors of beta estimates robust to heteroskedasticityโ€”variances of errors are non-constant across observations.


Likelihood Function

  • Likelihood is the probability of our data given the model.
knitr::include_graphics('https://bcdanl.github.io/lec_figs/logistic_likelihood.png')

  • The probability that the seven data points would be observed: \(L = (1-P1)*(1-P2)* P3*(1-P4)*P5*P6*P7\).
    • The log of the likelihood: \(\log(L) = \log(1-P1) + \log(1-P2) + \log(P3) + \log(1-P4) + \log(P5) + \log(P6) + \log(P7)\)
  • In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data.
    • This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable.


Deviance

  • Deviance measures to the distance between data and fit

  • The null deviance is similar to the variance of the data around the average rate of positive examples.

# likelihood function for logistic regression
## y: the outcome in numeric form, either 1 or 0
## py: the predicted probability that y == 1

loglikelihood <- function(y, py) {
  sum( y * log(py) + (1-y)*log(1 - py) )
}

# the rate of positive example in the dataset
pnull <- mean( as.numeric(train$atRisk) )

# the null deviance
null.dev <- -2 *loglikelihood(as.numeric(train$atRisk), pnull)

# the null deviance from summary(model)
model$null.deviance
[1] 2698.716
# the predicted probability for the training data
pred <- predict(model, newdata=train, type = "response")

# the residual deviance
resid.dev <- -2 * loglikelihood(as.numeric(train$atRisk), pred)

# the residual deviance from summary(model)
model$deviance
[1] 2462.992


AIC, AICc, BIC and pseudo R-squared

  • The AIC, or the Akaike information criterion, is the log likelihood adjusted for the number of coefficients.
    • The corrected AIC, AICc, is the AIC corrected by the sample size and the degree of freedom, which is superior to the AIC.
  • The BIC, or Bayesian information criterion, attempts to approximate the posterior probability that each model is best.
    • The BIC can be useful only when in small sample settings.
  • The pseudo R-squared is a goodness-of-fit measure of how much of the deviance is โ€œexplainedโ€ by the model.
# the AIC
AIC <- 2 * ( length( model$coefficients ) -
             loglikelihood( as.numeric(train$atRisk), pred) )
AIC
[1] 2490.992
model$aic
[1] 2490.992
# the pseudo R-squared
pseudo_r2 <- 1 - (resid.dev / null.dev)
  • Regression preserves the probabilities:
train$pred <- predict(model, newdata=train, type = "response")
test$pred <- predict(model, newdata=test, type="response")

sum(train$atRisk == TRUE)
[1] 273
sum(train$pred)
[1] 273
premature <- subset(train, GESTREC3 == "< 37 weeks")
sum(premature$atRisk == TRUE)
[1] 112
sum(premature$pred)
[1] 112


Average Marginal Effects

m <- margins(model)
ame_result <- summary(m)
ame_result
                   factor     AME     SE       z      p   lower   upper
                  CIG_REC  0.0064 0.0043  1.5000 0.1336 -0.0020  0.0148
 DPLURALtriplet or higher  0.0484 0.0290  1.6677 0.0954 -0.0085  0.1052
              DPLURALtwin  0.0064 0.0056  1.1480 0.2510 -0.0045  0.0173
       GESTREC3< 37 weeks  0.0450 0.0062  7.2235 0.0000  0.0328  0.0571
                     PWGT  0.0001 0.0000  2.5096 0.0121  0.0000  0.0001
               ULD_BREECH  0.0181 0.0056  3.2510 0.0012  0.0072  0.0290
                 ULD_MECO  0.0211 0.0082  2.5718 0.0101  0.0050  0.0371
               ULD_PRECIP  0.0038 0.0077  0.4946 0.6209 -0.0113  0.0189
                  UPREVIS -0.0012 0.0003 -4.0631 0.0000 -0.0017 -0.0006
               URF_CHYPER  0.0131 0.0115  1.1437 0.2528 -0.0094  0.0356
                 URF_DIAB -0.0055 0.0040 -1.3887 0.1649 -0.0133  0.0023
                URF_ECLAM  0.0114 0.0220  0.5196 0.6033 -0.0317  0.0545
               URF_PHYPER  0.0031 0.0052  0.6066 0.5441 -0.0070  0.0133
ggplot(data = ame_result) +
  geom_point( aes(factor, AME) ) +
  geom_errorbar(aes(x = factor, ymin = lower, ymax = upper), 
                width = .5) +
  geom_hline(yintercept = 0) +
  coord_flip()



References

Back to top