Regularized Logistic Regression

Hockey Player Evaluation

Author

Byeong-Hak Choe

Published

March 9, 2026

Modified

March 9, 2026

Setup

library(tidyverse)
library(broom)
library(stargazer)

library(glmnet) 
library(gamlr)   # a companion package for glmnet
library(Matrix)  # to create a sparse matrix

library(DT)
library(rmarkdown)
library(hrbrthemes)
library(ggthemes)

theme_set(
  theme_ipsum()
)

scale_colour_discrete <- function(...) scale_color_colorblind(...)
scale_fill_discrete <- function(...) scale_fill_colorblind(...)

Load Data

url <- "https://bcdanl.github.io/data/hockey.RDS"
dest <- file.path(tempdir(), "hockey.RDS")

download.file(url, destfile = dest, mode = "wb")
hockey <- readRDS(dest)

config <- hockey$config
goal   <- hockey$goal
player <- hockey$player  # dgCMatrix: a sparse matrix
team   <- hockey$team

rm(dest, url, hockey)   # remove unnecessary objects

Background and Motivation

The player “plus-minus” (PM) is a common hockey performance metric.
The classic PM is a function of goals scored while that player is on the ice:
- (the number of goals for his team) minus (the number against).
The limits of this approach are obvious: there is no accounting for teammates or opponents.
In hockey, where players tend to be grouped together on “lines” and coaches will “line match” against opponents, a player’s PM can be artificially inflated or deflated by the play of his opponents and peers.

Data

The data comprise of play-by-play NHL game data for regular and playoff games during 11 seasons of 2002-2003 through 2013-2014.
There were p = 2,439 players involved in n = 69,449 goals.
The data contains information that indicates seasons, home & away teams, team configuration such as 5 on 4 powerplay, and which players are on & off the ice when a goal is made, etc.
homegoal: an indicator (0 or 1) for the home team scoring
player_name: entries for who was on the ice for each goal
team: indicators for each team
config: Special teams info. E.g., S5v4 is a 5 on 4 powerplay
The value of config, team, and player_name are:
- 1 if it is for the home-team
- -1 for the away team
- 0 otherwise

Game Configurations (config)

5-on-5 (Even Strength): The most common game state, where each team has five skaters on the ice, excluding goalies. Corsi and Fenwick in this situation are considered the most reliable indicators of player performance since it minimizes the influence of special teams.
5-on-4 (Power Play): A situation where the player’s team has a one-man advantage due to an opponent’s penalty. Metrics in this scenario provide insight into how effectively a player contributes to creating offensive opportunities.

The code below convert a sparse matrix into a data.frame:

df_player <- player |> 
  as.matrix() |> 
  as_tibble(.name_repair = "unique")

df_config <- config |> 
  as.matrix() |> 
  as_tibble(.name_repair = "unique")

df_team <- team |> 
  as.matrix() |> 
  as_tibble(.name_repair = "unique")

Model

Consider constructing a binary response for every goal, equal to 1 for home-team goals and 0 for away-team goals:

\[ \begin{align} \log\left(\frac{\text{Prob}(\text{home-goal}_{i})}{\text{Prob}(\text{away-goal}_{i})}\right) &= \beta_0 + \sum_{j=1}^{J} \beta_{\text{team}_{j}}\text{team}_{j, i} + \sum_{k=1}^{K} \beta_{\text{config}_{k}}\text{config}_{k, i} \\ &\qquad + \sum_{m=1}^{M} \beta_{\text{home-player}_{m}}\text{home-player}_{m, i}\\ &\qquad - \sum_{n=1}^{N} \beta_{\text{away-player}_{n}}\text{away-player}_{n, i} \end{align} \]

For each goal event \(i\), the model predicts whether the goal was scored by the home team (\(1\)) or the away team (\(0\)) so that we can evaluate each player’s contribution to goal scoring while accounting for teammates, opponents, teams, and game situations.

The left-hand side is the log-odds that goal \(i\) was scored by the home team rather than the away team.
\(\beta_0\) is the baseline log-odds of a home-team goal.
The team terms control for the strengths of the teams involved.
The configuration terms control for the game situation, such as even strength or power play.
The player terms measure how the players on the ice affect the odds of a home-team goal, after controlling for team strength and game configuration.

So, this model says that the chance a goal is scored by the home team depends on:

which teams are playing,
the game configuration, and
which home and away players are on the ice.

A positive player coefficient means that the player is associated with a greater likelihood that a goal is scored by his team, after accounting for teammates, opponents, team effects, and special-team situations.

This makes the model-based plus-minus more informative than the traditional plus-minus, because it adjusts for context rather than simply counting net goals while a player is on the ice.

Lasso Logistic Regression

# A vector for the outcome variable
y <- goal$homegoal

# Sparse matrices for explanatory variables
x <- cbind(config, team, player) 
# A vector of penalty indicators (one value per column in x)
# - 1  => penalize this coefficient (default behavior)
# - 0  => do NOT penalize this coefficient (it will be kept "unshrunk")
p.fac <- rep(1, ncol(config) + ncol(team) + ncol(player))

# Set penalty.factor = 0 for config + team:
# config + team columns will be unpenalized (like "must-keep" controls),
# while the player columns remain penalized (eligible for shrinkage/selection).
p.fac[1:(ncol(config) + ncol(team))] <- 0

# Fit cross-validated lasso/logit (alpha default is 1 unless you set it)

# - penalty.factor applies column-by-column to x
# - standardize=FALSE => do NOT standardize predictors internally
# - standardize=TRUE => (default option) standardize predictors internally

set.seed(1)
model_lasso <- cv.glmnet(
  x, y,
  penalty.factor = p.fac,
  family = "binomial",
  standardize = FALSE
)

Scaling (`standardize = FALSE`)?

We do not want to standardize player indicators here.
- By default, glmnet internally standardize predictors (standardize = TRUE): \[ X_j^{*}=\frac{X_j-\bar X_j}{SD(X_j)}. \] The lasso penalty is applied to coefficients on this standardized scale. Converting back to the original scale implies an effective penalty like \[ \lambda \times SD(X_j)\,|\beta_j|. \]
For player columns, \(X_{\text{player}}\) is close to a 0/1 indicator (or a signed on-ice indicator).
- Small \(SD(X_{\text{player}})\) typically means the player appears rarely (almost always 0).
- Large \(SD(X_{\text{player}})\) typically means the player appears often (more nonzero entries).
Standardizing would therefore inflate (divide by a small SD) the columns for players who rarely play, making their “on-ice” entries look large on the standardized scale.
- As a result, those rarely used player variables can be shrunk less and may appear more influential than is justified by their limited ice time.
Setting standardize = FALSE helps keep the comparison on the original exposure (ice-time / appearance) scale.

Note

In this model, our goal is not prediction. Instead, we focus on inference: how a player’s presence is associated with the odds of a goal—specifically, the probability that the home team scores relative to the probability that the away team scores.

Because we are not evaluating out-of-sample predictive performance, we do not split the data into training and test sets here.

Beta Estimates

beta_lasso_1se <- coef(model_lasso, 
                       s = "lambda.1se")  # default
beta_lasso_min <- coef(model_lasso, 
                       s = "lambda.min")

betas <- data.frame(
  term = rownames(beta_lasso_1se),
  beta_lasso_1se = as.numeric(beta_lasso_1se),
  beta_lasso_min = as.numeric(beta_lasso_min)
) |> 
  mutate(
  exp_beta_lasso_1se_minus_1 = exp(beta_lasso_1se) - 1,
  exp_beta_lasso_min_minus_1 = exp(beta_lasso_min) - 1
  )

betas |> 
  paged_table()

Home-ice Advantage

How can we estimate the home-team effect on odds?
This is the effect on odds that a goal is home rather than away, regardless of any information about what teams are playing or who is on ice.

hometeam_effect <- betas |> 
  filter(str_detect(term, "Intercept"))

hometeam_effect |> 
  select(term, 
         exp_beta_lasso_1se_minus_1, exp_beta_lasso_min_minus_1) |> 
  paged_table()

We can look at the intercept for a home-ice advantage.
- Without controlling for any other covariates, the home team is about 8% more likely to score a given goal than the away team.
- This suggests a substantial home-ice advantage.

Player’s Impact

Peter Forsberg

How can we estimate the player effect on odds?

betas_players <- betas |> 
  filter(term %in% colnames(player))

top10_players <- betas_players |> 
  arrange(-beta_lasso_1se) |> 
  head(10)

top10_players  |> 
  select(term, 
         exp_beta_lasso_1se_minus_1, exp_beta_lasso_min_minus_1) |> 
  paged_table()

Whenever a goal is scored, the odds that PETER_FORSBERG’s team scored it, rather than allowed it, are about 97% higher when PETER_FORSBERG is on the ice!

Traditional Plus‑Minus vs. Expected Plus‑Minus

Traditional Plus‑Minus (PM)
- Definition: Sum of on-ice contributions; +1 for a goal scored for the player’s team and -1 for a goal against.
- Limitations: Can be noisy due to team context, limited ice time, or random variation.
Expected Plus‑Minus (ppm)
- Definition: A model-based estimate of a player’s impact.
- Calculation:
  - Convert a player’s effect (\(\beta\)) to a probability:
  \[p = \frac{e^{\beta}}{1 + e^{\beta}}\]
  - Compute expected PM as:
  \[\text{ppm} = ng \times p − ng \times (1−p )\]
  - \(ng\): the total number of goals the player was on the ice.
- Benefits: Smooths out noise and adjusts for context.
Observed vs. Modeled:
- PM is a raw, observed measure.
- Expected PM leverages model estimates to predict performance.
Variability:
- PM may fluctuate due to external factors.
- Expected PM attempts to isolate a player’s true impact.
Applications:
- Expected PM can help identify under- or over-performing players and guide strategic decisions.
How can we calculate PM and ppm?

Code

# 2013-2014 season indicator
season_1314 <- goal$season == "20132014"

# either -1 or 1 from c(-1, 1)
#   since `y[season_1314] + 1` is either 1 or 2.
#   so c(-1, 1)[1] or c(-1, 1)[2]
goal_sign <- c(-1, 1)[ y[season_1314] + 1 ]

# Traditional plus-minus
pm <- player[season_1314, betas_players$term] |>
  as.matrix()

# colSums(): the sum of values for each column of a matrix
pm <- colSums(pm * goal_sign)

# Total number of goals involving each player
ng <- player[season_1314, betas_players$term] |>
  as.matrix() |>
  abs() |>
  colSums()

# data.frame summary for PM and ppm:
pm_ppm <- betas_players |>
  mutate(
    # Total number of goals involving each player
    ng = ng[term],
    
    # Traditional plus minus
    pm = pm[term],
    
    # Probability that a given goal is for the player's team
    p_1se = exp(beta_lasso_1se) / (1+exp(beta_lasso_1se)), 
    p_min = exp(beta_lasso_min) / (1+exp(beta_lasso_min)),

    # Expected plus-minus (ppm)
    ppm_1se = ng * p_1se - ng * (1 - p_1se),
    ppm_min = ng * p_min - ng * (1 - p_min)
  ) |>
  select(term, pm, ppm_1se, ppm_min, beta_lasso_1se, beta_lasso_min, everything()) |> 
  arrange(-ppm_1se, -pm)

pm_ppm |> 
  paged_table()

HENRIK_LUNDQVIST has a fairly low traditional plus-minus (pm = 9), but his expected plus-minus is much higher (ppm_1se = 17.54049781).

This suggests that the traditional plus-minus may understate his contribution because it does not control for teammates, opponents, team effects, or game situations. By contrast, expected plus-minus adjusts for these factors, so Lundqvist’s stronger ppm appears more consistent with our evaluation of him as an elite player.

Notice that pm and ppm are positively correlated, but not perfectly so:

cor(pm_ppm$pm, pm_ppm$ppm_1se)

[1] 0.4548581

cor(pm_ppm$pm, pm_ppm$ppm_min)

[1] 0.5036819

Discussion

Welcome to our Classwork 7 Discussion Board! 👋

This space is designed for you to engage with your classmates about the material covered in Classwork 7.

Whether you are looking to delve deeper into the content, share insights, or have questions about the content, this is the perfect place for you.

If you have any specific questions for Byeong-Hak (@bcdanl) regarding the Classwork 7 materials or need clarification on any points, don’t hesitate to ask here.

All comments will be stored here.

Let’s collaborate and learn from each other!