Classwork 7

Regularized Logistic Regression — Hockey Player Evaluation

Author

Byeong-Hak Choe

Published

March 9, 2026

Modified

March 9, 2026

Setup

library(tidyverse)
library(broom)
library(stargazer)

library(glmnet) 
library(gamlr)   # a companion package for glmnet
library(Matrix)  # to create a sparse matrix

library(DT)
library(rmarkdown)
library(hrbrthemes)
library(ggthemes)

theme_set(
  theme_ipsum()
)

scale_colour_discrete <- function(...) scale_color_colorblind(...)
scale_fill_discrete <- function(...) scale_fill_colorblind(...)

Load Data

url <- "https://bcdanl.github.io/data/hockey.RDS"
dest <- file.path(tempdir(), "hockey.RDS")

download.file(url, destfile = dest, mode = "wb")
hockey <- readRDS(dest)

config <- hockey$config
goal   <- hockey$goal
player <- hockey$player  # dgCMatrix: a sparse matrix
team   <- hockey$team

rm(dest, url, hockey)   # remove unnecessary objects

Background and Motivation

  • The player “plus-minus” (PM) is a common hockey performance metric.
  • The classic PM is a function of goals scored while that player is on the ice:
    • (the number of goals for his team) minus (the number against).
  • The limits of this approach are obvious: there is no accounting for teammates or opponents.
  • In hockey, where players tend to be grouped together on “lines” and coaches will “line match” against opponents, a player’s PM can be artificially inflated or deflated by the play of his opponents and peers.

Data

  • The data comprise of play-by-play NHL game data for regular and playoff games during 11 seasons of 2002-2003 through 2013-2014.

  • There were p = 2,439 players involved in n = 69,449 goals.

  • The data contains information that indicates seasons, home & away teams, team configuration such as 5 on 4 powerplay, and which players are on & off the ice when a goal is made, etc.

  • homegoal: an indicator (0 or 1) for the home team scoring

  • player_name: entries for who was on the ice for each goal

  • team: indicators for each team

  • config: Special teams info. E.g., S5v4 is a 5 on 4 powerplay

  • The value of config, team, and player_name are:

    • 1 if it is for the home-team
    • -1 for the away team
    • 0 otherwise
Game Configurations (config)
  • 5-on-5 (Even Strength): The most common game state, where each team has five skaters on the ice, excluding goalies. Corsi and Fenwick in this situation are considered the most reliable indicators of player performance since it minimizes the influence of special teams.
  • 5-on-4 (Power Play): A situation where the player’s team has a one-man advantage due to an opponent’s penalty. Metrics in this scenario provide insight into how effectively a player contributes to creating offensive opportunities.

The code below convert a sparse matrix into a data.frame:

df_player <- player |> 
  as.matrix() |> 
  as_tibble(.name_repair = "unique")

df_config <- config |> 
  as.matrix() |> 
  as_tibble(.name_repair = "unique")

df_team <- team |> 
  as.matrix() |> 
  as_tibble(.name_repair = "unique")

Model

Consider constructing a binary response for every goal, equal to 1 for home-team goals and 0 for away-team goals:

\[ \begin{align} \log\left(\frac{\text{Prob}(\text{home-goal}_{i})}{\text{Prob}(\text{away-goal}_{i})}\right) &= \beta_0 + \sum_{j=1}^{J} \beta_{\text{team}_{j}}\text{team}_{j, i} + \sum_{k=1}^{K} \beta_{\text{config}_{k}}\text{config}_{k, i} \\ &\qquad + \sum_{m=1}^{M} \beta_{\text{home-player}_{m}}\text{home-player}_{m, i}\\ &\qquad - \sum_{n=1}^{N} \beta_{\text{away-player}_{n}}\text{away-player}_{n, i} \end{align} \]

Lasso Logistic Regression

# A vector for the outcome variable
y <- goal$homegoal

# Sparse matrices for explanatory variables
x <- cbind(config, team, player) 
# A vector of penalty indicators (one value per column in x)
# - 1  => penalize this coefficient (default behavior)
# - 0  => do NOT penalize this coefficient (it will be kept "unshrunk")
p.fac <- rep(1, ncol(config) + ncol(team) + ncol(player))

# Set penalty.factor = 0 for config + team:
# config + team columns will be unpenalized (like "must-keep" controls),
# while the player columns remain penalized (eligible for shrinkage/selection).
p.fac[1:(ncol(config) + ncol(team))] <- 0

# Fit cross-validated lasso/logit (alpha default is 1 unless you set it)

# - penalty.factor applies column-by-column to x
# - standardize=FALSE => do NOT standardize predictors internally
# - standardize=TRUE => (default option) standardize predictors internally

set.seed(1)
model_lasso <- cv.glmnet(
  x, y,
  penalty.factor = p.fac,
  family = "binomial",
  standardize = FALSE
)

Scaling (standardize = FALSE)?

  • We do not want to standardize player indicators here.
    • By default, glmnet internally standardize predictors (standardize = TRUE): \[ X_j^{*}=\frac{X_j-\bar X_j}{SD(X_j)}. \] The lasso penalty is applied to coefficients on this standardized scale. Converting back to the original scale implies an effective penalty like \[ \lambda \times SD(X_j)\,|\beta_j|. \]
  • For player columns, \(X_{\text{player}}\) is close to a 0/1 indicator (or a signed on-ice indicator).
    • Small \(SD(X_{\text{player}})\) typically means the player appears rarely (almost always 0).
    • Large \(SD(X_{\text{player}})\) typically means the player appears often (more nonzero entries).
  • Standardizing would therefore inflate (divide by a small SD) the columns for players who rarely play, making their “on-ice” entries look large on the standardized scale.
    • As a result, those rarely used player variables can be shrunk less and may appear more influential than is justified by their limited ice time.
  • Setting standardize = FALSE helps keep the comparison on the original exposure (ice-time / appearance) scale.
Note

In this model, our goal is not prediction. Instead, we focus on inference: how a player’s presence is associated with the odds of a goal—specifically, the probability that the home team scores relative to the probability that the away team scores.

Because we are not evaluating out-of-sample predictive performance, we do not split the data into training and test sets here.

Home-ice Advantage

  • How can we estimate the home-team effect on odds?

  • This is the effect on odds that a goal is home rather than away, regardless of any information about what teams are playing or who is on ice.

Player’s Impact

Peter Forsberg

  • How can we estimate the player effect on odds?

Traditional Plus‑Minus vs. Expected Plus‑Minus

  • Traditional Plus‑Minus (PM)
    • Definition: Sum of on-ice contributions; +1 for a goal scored for the player’s team and -1 for a goal against.
    • Limitations: Can be noisy due to team context, limited ice time, or random variation.
  • Expected Plus‑Minus (ppm)
    • Definition: A model-based estimate of a player’s impact.

    • Calculation:

      • Convert a player’s effect (\(\beta\)) to a probability:

      \[p = \frac{e^{\beta}}{1 + e^{\beta}}\]

      • Compute expected PM as:

      \[\text{ppm} = ng \times p − ng \times (1−p )\]

      • \(ng\): the total number of goals the player was on the ice.
    • Benefits: Smooths out noise and adjusts for context.

  • Observed vs. Modeled:
    • PM is a raw, observed measure.
    • Expected PM leverages model estimates to predict performance.
  • Variability:
    • PM may fluctuate due to external factors.
    • Expected PM attempts to isolate a player’s true impact.
  • Applications:
    • Expected PM can help identify under- or over-performing players and guide strategic decisions.
  • How can we calculate PM and ppm?



Discussion

Welcome to our Classwork 7 Discussion Board! 👋

This space is designed for you to engage with your classmates about the material covered in Classwork 7.

Whether you are looking to delve deeper into the content, share insights, or have questions about the content, this is the perfect place for you.

If you have any specific questions for Byeong-Hak (@bcdanl) regarding the Classwork 7 materials or need clarification on any points, don’t hesitate to ask here.

All comments will be stored here.

Let’s collaborate and learn from each other!

Back to top