Simple PWHL xG Model
Introduction
This document describes an expected goals model for the PWHL.
Important features:
the model is a Multivariate Adaptive Regression Splines model (a “MARS model”); and
the variables used to predict goals are shot distance, shot angle, and shot rebound status.
That’s not many variables, obviously. There isn’t much PWHL data available (72 regular season games) so I used only the most important variables for an xG model.
Data for the model were pulled from the PWHL’s API (using functions that I posted on GitHub). There are anomalies and errors in the data. The steps I took to “fix” the data are set out in painful detail below. Most people will have no interest in those details and can skip them. However, if you plan to use data from the PWHL’s API then you must be aware of the issues with the “raw” data.
Basic Setup
Load the packages and the raw play-by-play data (this assumes the data are saved in the working directory).
#install.packages("tidymodels")
#install.packages("readr")
#install.packages("vip")
#install.packages("kableExtra")
library(tidymodels)
library(readr)
library(vip)
library(kableExtra)
<- read_rds("season_one_pbp.rds") raw_data
Functions
Set out below are four functions that adjust and augment the raw play-by-play data.
Adjust Event Location
The x | y location data from the PWHL’s API are on a scale of 600:300 (which does not match the dimensions of a regulation rink).
This function converts the location data to a scale of 200:85 and places center ice at 0,0. After the conversion the axes represent distance in feet. This adjustment might not be appropriate in every case; however, it seems to produce reasonable results.
<- function(pbp_data) {
adjust_event_location
<- pbp_data |>
pbp_data mutate(x_location = (x_location - 300) * 0.3333,
y_location = (y_location - 150) * 0.2833)
return(pbp_data)
}
Add Shot Distance
This function adds a shot distance variable to the play-by-play data. The distance is measured in feet.
<- function(pbp_data) {
add_shot_distance
<- pbp_data |>
find_sides filter(event == "shot") |>
group_by(game_id,
|>
event_team) summarize(mean_shot = mean(x_location,
na.rm = TRUE),
.groups = "drop")
<- pbp_data |>
pbp_data left_join(find_sides,
by = join_by(game_id,
event_team))
<- pbp_data |>
pbp_data mutate(distance = case_when(
> 0 & event == "shot" ~ round(abs(sqrt((x_location - 89)^2 + (y_location)^2)), 1),
mean_shot < 0 & event == "shot" ~ round(abs(sqrt((x_location - (-89))^2 + (y_location)^2)), 1)),
mean_shot .after = y_location)
<- pbp_data |>
pbp_data select(-mean_shot)
return(pbp_data)
}
Add Shot Angle
This function adds a shot angle variable to the play-by-play data. The angle is measured in degrees from the center of the net.
<- function(pbp_data) {
add_shot_angle
<- pbp_data |>
find_sides filter(event == "shot") |>
group_by(game_id,
|>
event_team) summarize(mean_shot = mean(x_location,
na.rm = TRUE),
.groups = "drop")
<- pbp_data |>
pbp_data left_join(find_sides,
by = join_by(game_id,
event_team))
<- pbp_data |>
pbp_data mutate(angle = case_when(
> 0 & event == "shot" ~ round(abs(atan((0-y_location) / (89-x_location)) * (180 / pi)), 1),
mean_shot < 0 & event == "shot" ~ round(abs(atan((0-y_location) / (-89-x_location)) * (180 / pi)), 1)),
mean_shot .after = distance) |>
mutate(angle = ifelse((mean_shot > 0 & x_location > 89) | (mean_shot < 0 & x_location < -89), 180 - angle, angle))
<- pbp_data |>
pbp_data select(-mean_shot)
return(pbp_data)
}
Add Rebound Shots
This function adds a shot rebound logical variable to the play-by-play data. A shot is considered a rebound opportunity if it is taken within 2 seconds of a prior shot.
<- function(pbp_data) {
add_rebound
<- pbp_data |>
pbp_data mutate(is_rebound = if_else((event == "shot" & lag(event) == "shot") & (game_seconds - lag(game_seconds)) < 3, TRUE, FALSE),
.after = is_goal)
return(pbp_data)
}
Data Cleaning | EDA
Adjust the x | y locations and add shot distance to the play-by-play data - this will be helpful for exploring and “fixing” the data.
<- raw_data |>
clean_data adjust_event_location() |>
add_shot_distance()
Mean x_location
Check for anomalies in the average x_location for shots.
There are some odd results here, especially in game_id 3, 9, 23, 35, 38, 40, 43, and 71.
Warning: this next section is a grind to get through.
I’ll plot the suspicious results, game-by-game. I’ll also plot my “fix” for any potential errors. Generally speaking, I decided how to fix the results by looking for suspicious patterns in the underlying data and by looking at the results posted on the PWHL website. I have not reviewed video of each game. That would be the best way to audit the location data but would also be hugely time consuming.
Note that the shot locations are suspiciously close to the middle of the ice. I’ll return to this issue below. For now, here’s my fix: flip the coordinates (both x and y) for periods 2 and 4.
That looks a little better. Now for game_id 9.
My fix: flip the coordinates (both x and y) for period 1.
That looks a little better (the long-distance goal is an empty net goal). Now for game_id 23.
This is probably the most suspicious case of the shots being too close to the middle of the ice - I’ll come back to this issue below.
My fix: flip the coordinates (both x and y) for periods 1 and 4.
That looks a little better except for the long-distance goal. I’ll return to that later. Now for game_id 35.
My fix: flip the coordinates (both x and y) for all long-distance shots (which appear in odd clumps in the data).
That looks a little better. Now for game_id 38.
My fix: flip the coordinates (both x and y) for all long-distance shots.
That looks a little better. Now for game_id 40.
My fix: flip the coordinates (both x and y) for periods 2 and 4.
That looks a little better. Now for game_id 43.
My fix: flip the coordinates (both x and y) for long-distance shots in period 1.
That looks a little better. Now for game_id 71.
My fix: flip the coordinates (both x and y) for period 2, plus a subset of shots by OTT in period 3.
That looks a little better.
That was painful. How does the mean x_location plot look now?
This looks much better but there could be more errors in the data. To find other potential errors, replace the distance data (using the new x | y locations) and then summarize long-distance shots that are marked as “quality” in the PWHL’s data.
game_id | Errors |
---|---|
3 | 1 |
9 | 2 |
10 | 1 |
11 | 2 |
12 | 2 |
13 | 1 |
33 | 1 |
37 | 2 |
40 | 1 |
44 | 1 |
50 | 1 |
53 | 2 |
There are some potential errors (and data from early in the season seems especially suspicious). I’ll flip the coordinates for these shots and then reset the shot distance variable in the clean data.
Repeat the potential errors summary used above.
game_id | location_errors |
---|---|
The potential errors no longer appear in the data.
Max | Min y_location
Check for anomalies in the y_location data by looking at the maximum and minimum y_location for the shots in each game.
There are some suspicious results here, especially the first few games of the season (but also look at game_id 23). As noted above, the y_location for some games seems too close to the middle of the ice. The trend line shows the max and min y_location values are narrower at the start of the season.
Any errors in the y_location data will obviously affect the xG model (decreasing the angle and distance of the shots). I’m not going to apply an arbitrary adjustment, and I’m not going to watch video for the most suspicious games. I’ll leave the y_location data “as-is” but I definitely have concerns about it.
Mean Shot Distance
Plot the average shot distance by each team for every game_id as a further check for location errors.
No massive outliers here. While I expect there are still some errors in the data I’ve probably caught a good chunk of them.
Location Of Goals
Plot the location of all goals (excluding empty net goals and penalty shots).
There are two suspicious goals here: one in the neutral zone and one in the defensive zone (the yellow point).
I watched both goals on YouTube. The defensive zone goal was not a long-distance shot. The x_location needs to be flipped for this goal. The neutral zone goal was from long-distance - the puck took a funny bounce off the boards and went in the net when the goalie went out to play it.
I’ll fix the defensive zone goal by flipping the x_location and recomputing the distance variable.
Update the plot to make sure that fix worked.
That looks about right.
Explore Shot Variables
The above plots showed that the distance variable seems to be working OK. Now add the angle and is_rebound variables to the play-by-play data and visualize them.
Repeat the above goals plot but show the angle of the shots using colour.