library(tidyverse)
library(broom)
library(stargazer)
library(glmnet)
library(gamlr) # a companion package for glmnet
library(Matrix) # to create a sparse matrix
library(DT)
library(rmarkdown)
library(hrbrthemes)
library(ggthemes)
theme_set(
theme_ipsum()
)
scale_colour_discrete <- function(...) scale_color_colorblind(...)
scale_fill_discrete <- function(...) scale_fill_colorblind(...)Classwork 7
Regularized Logistic Regression — Hockey Player Evaluation
Setup
Load Data
url <- "https://bcdanl.github.io/data/hockey.RDS"
dest <- file.path(tempdir(), "hockey.RDS")
download.file(url, destfile = dest, mode = "wb")
hockey <- readRDS(dest)
config <- hockey$config
goal <- hockey$goal
player <- hockey$player # dgCMatrix: a sparse matrix
team <- hockey$team
rm(dest, url, hockey) # remove unnecessary objectsBackground and Motivation
- The player “plus-minus” (PM) is a common hockey performance metric.
- The classic PM is a function of goals scored while that player is on the ice:
- (the number of goals for his team) minus (the number against).
- The limits of this approach are obvious: there is no accounting for teammates or opponents.
- In hockey, where players tend to be grouped together on “lines” and coaches will “line match” against opponents, a player’s PM can be artificially inflated or deflated by the play of his opponents and peers.
Data
The data comprise of play-by-play NHL game data for regular and playoff games during 11 seasons of 2002-2003 through 2013-2014.
There were p = 2,439 players involved in n = 69,449 goals.
The data contains information that indicates seasons, home & away teams, team configuration such as 5 on 4 powerplay, and which players are on & off the ice when a goal is made, etc.
homegoal: an indicator (0 or 1) for the home team scoringplayer_name: entries for who was on the ice for each goalteam: indicators for each teamconfig: Special teams info. E.g.,S5v4is a 5 on 4 powerplayThe value of
config,team, andplayer_nameare:- 1 if it is for the home-team
- -1 for the away team
- 0 otherwise
config)
- 5-on-5 (Even Strength): The most common game state, where each team has five skaters on the ice, excluding goalies. Corsi and Fenwick in this situation are considered the most reliable indicators of player performance since it minimizes the influence of special teams.
- 5-on-4 (Power Play): A situation where the player’s team has a one-man advantage due to an opponent’s penalty. Metrics in this scenario provide insight into how effectively a player contributes to creating offensive opportunities.
The code below convert a sparse matrix into a data.frame:
df_player <- player |>
as.matrix() |>
as_tibble(.name_repair = "unique")
df_config <- config |>
as.matrix() |>
as_tibble(.name_repair = "unique")
df_team <- team |>
as.matrix() |>
as_tibble(.name_repair = "unique")Model
Consider constructing a binary response for every goal, equal to 1 for home-team goals and 0 for away-team goals:
\[ \begin{align} \log\left(\frac{\text{Prob}(\text{home-goal}_{i})}{\text{Prob}(\text{away-goal}_{i})}\right) &= \beta_0 + \sum_{j=1}^{J} \beta_{\text{team}_{j}}\text{team}_{j, i} + \sum_{k=1}^{K} \beta_{\text{config}_{k}}\text{config}_{k, i} \\ &\qquad + \sum_{m=1}^{M} \beta_{\text{home-player}_{m}}\text{home-player}_{m, i}\\ &\qquad - \sum_{n=1}^{N} \beta_{\text{away-player}_{n}}\text{away-player}_{n, i} \end{align} \]
Lasso Logistic Regression
# A vector for the outcome variable
y <- goal$homegoal
# Sparse matrices for explanatory variables
x <- cbind(config, team, player)
# A vector of penalty indicators (one value per column in x)
# - 1 => penalize this coefficient (default behavior)
# - 0 => do NOT penalize this coefficient (it will be kept "unshrunk")
p.fac <- rep(1, ncol(config) + ncol(team) + ncol(player))
# Set penalty.factor = 0 for config + team:
# config + team columns will be unpenalized (like "must-keep" controls),
# while the player columns remain penalized (eligible for shrinkage/selection).
p.fac[1:(ncol(config) + ncol(team))] <- 0
# Fit cross-validated lasso/logit (alpha default is 1 unless you set it)
# - penalty.factor applies column-by-column to x
# - standardize=FALSE => do NOT standardize predictors internally
# - standardize=TRUE => (default option) standardize predictors internally
set.seed(1)
model_lasso <- cv.glmnet(
x, y,
penalty.factor = p.fac,
family = "binomial",
standardize = FALSE
)Scaling (standardize = FALSE)?
- We do not want to standardize player indicators here.
- By default,
glmnetinternally standardize predictors (standardize = TRUE): \[ X_j^{*}=\frac{X_j-\bar X_j}{SD(X_j)}. \] The lasso penalty is applied to coefficients on this standardized scale. Converting back to the original scale implies an effective penalty like \[ \lambda \times SD(X_j)\,|\beta_j|. \]
- By default,
- For player columns, \(X_{\text{player}}\) is close to a 0/1 indicator (or a signed on-ice indicator).
- Small \(SD(X_{\text{player}})\) typically means the player appears rarely (almost always 0).
- Large \(SD(X_{\text{player}})\) typically means the player appears often (more nonzero entries).
- Standardizing would therefore inflate (divide by a small SD) the columns for players who rarely play, making their “on-ice” entries look large on the standardized scale.
- As a result, those rarely used player variables can be shrunk less and may appear more influential than is justified by their limited ice time.
- Setting
standardize = FALSEhelps keep the comparison on the original exposure (ice-time / appearance) scale.
In this model, our goal is not prediction. Instead, we focus on inference: how a player’s presence is associated with the odds of a goal—specifically, the probability that the home team scores relative to the probability that the away team scores.
Because we are not evaluating out-of-sample predictive performance, we do not split the data into training and test sets here.
Home-ice Advantage
How can we estimate the home-team effect on odds?
This is the effect on odds that a goal is home rather than away, regardless of any information about what teams are playing or who is on ice.
Player’s Impact
Peter Forsberg
- How can we estimate the player effect on odds?
Traditional Plus‑Minus vs. Expected Plus‑Minus
- Traditional Plus‑Minus (PM)
- Definition: Sum of on-ice contributions; +1 for a goal scored for the player’s team and -1 for a goal against.
- Limitations: Can be noisy due to team context, limited ice time, or random variation.
- Expected Plus‑Minus (ppm)
Definition: A model-based estimate of a player’s impact.
Calculation:
- Convert a player’s effect (\(\beta\)) to a probability:
\[p = \frac{e^{\beta}}{1 + e^{\beta}}\]
- Compute expected PM as:
\[\text{ppm} = ng \times p − ng \times (1−p )\]
- \(ng\): the total number of goals the player was on the ice.
Benefits: Smooths out noise and adjusts for context.
- Observed vs. Modeled:
- PM is a raw, observed measure.
- Expected PM leverages model estimates to predict performance.
- PM is a raw, observed measure.
- Variability:
- PM may fluctuate due to external factors.
- Expected PM attempts to isolate a player’s true impact.
- PM may fluctuate due to external factors.
- Applications:
- Expected PM can help identify under- or over-performing players and guide strategic decisions.
- How can we calculate PM and ppm?
Discussion
Welcome to our Classwork 7 Discussion Board! 👋
This space is designed for you to engage with your classmates about the material covered in Classwork 7.
Whether you are looking to delve deeper into the content, share insights, or have questions about the content, this is the perfect place for you.
If you have any specific questions for Byeong-Hak (@bcdanl) regarding the Classwork 7 materials or need clarification on any points, don’t hesitate to ask here.
All comments will be stored here.
Let’s collaborate and learn from each other!