Simon Chi

# HOW TO ESTIMATE WIN PROBABILITY DURING A GAME

It would be interesting and fun to be able to provide a win-probability during a game. For example, if there are 5 minutes left, and the home team is trailing by 2 points, how likely are they to win? And, even more importantly: imagine the home team is awarded a penalty just outside the opposition 22, just about halfway between the touch lines. Then we might want to know how their win probability might differ between a kick at goal, or a kick to the corner for an attacking lineout?

In this article, I’ll explore how we could (and how we have) built a model that lets us make such predictions.

**The Structure of the Base Model**

The structure of the base model is very simple: given the current points difference (home team score minus the away team score), and the time remaining in the game, we wish to produce a probability that the home team will win. This may seem too simplistic, because it does not take into consideration what team is in possession of the ball, and where they are on the field. But we can refine the prediction using this sort of information afterward. That is described below.

So, we want to create a function, that is in the form:

*Pwin = f(score_add, mins_remaining)*

Where

*Pwin = *probability of the home team winning (a number from 0 to 1)

*score_add*= score advantage: home team score minus the away team score

*mins_remaining *= minutes remaining in the game.

Note that minutes remaining would typically be between 80 and 0. But it is the time on the clock minus 80 min. So, it can actually go negative, when extra time is being played.

If we can come up with a simple, closed-form (easy to calculate) formula *f()*, we will be able to provide a baseline win probability at any point of a game, either when reviewing a past game, or, in real time, during a game.

**Every Good Model Needs (lots of) Good Data**

We could make some sensible guesses about how *Pwin* could be calculated. But, it would be even better to use lots of real game data to build this model. So, the first step was to gather quite a lot of data. This data consisted of data drawn from more than 7000 professional matches: internationals as well as top-tier club competitions. For each game, the beginning of each possession within the game was used as a game-state. For each game-state, we note the *mins_remaining***, **the *score_add *(home team score minus the away team score), and we also note whether the home team eventually won the game or not. This yielded more than 600,000 game-states upon which the model could be built.

**What didn’t work well: Binning the Data**

We could then gather the data into bins. Since we have two dimensions to the input data, each bin will be a *mins_remaining*, and a *score_add* value. For each bin, (let’s say, for example, all game-states where there are 5 minutes remaining and the home team has 2 points fewer than the way team), we will ideally have many records of game-states. We can then simply count, for that bin, how many times the home team won, divide it by the number of game-states in that bin, and that is our *Pwin* value for that bin. If we do this for every possible bin (every possible combination of *score_add *and *mins_remaining* values, we then would have a big look-up table: a grid of win probabilities for every possible game-state.

In concept, that is all there is to it. In practice, we need to be a bit cleverer than that. More than 600,000 game states seems like a lot. But it isn’t really enough to have many values for every possible game state. The game states are not uniformly distributed across all bins. (For example, there will be very few games where, 5 minutes into the game, the home team is up by 25 points!). So, when there are few games in a bin, there could (and there is) a lot of noise in the win probability data. It can jump around a lot.

To help put more games in any given bin (to reduce the noise), the bins for any given number of minutes remaining could be ‘stretched’ to include any game states with within, say, 3 minutes of that number of minutes remaining. This greatly increases the number of game-states in each bin, thereby reducing the noise in the per-bin *Pwin* values. But it has the potential of making the model less sensitive to time remaining.

Below is a figure that shows what this bin by bin win probability data looks like, roughly.

**Figure 1:** raw win probability data. Horizonal axis is the game minute (80 – *mins_remaining*). Each color line is a different value of points difference (*score_add*)*.* The vertical axis is the win probability. 0: the away team will win; 1: the home team will win. One can see that this data is relatively messy.

**Trying to Fit Curves to the Data**

Even with the extended bins, the win probability data gathered from the more than 600,000 game states is messy. But we know that, as the game progresses, minute by minute, win probabilities, based on score difference and time remaining, should vary smoothly. So, we could clean up this messy data by fitting curves to this data, instead of simply using a lookup-table.

One could, for example, fit polynomials to the *Pwin *data as a function of *mins_remaining*, for each *score_add* data. And I tried that. But, it didn’t go particularly well. The set of curves were messy. The various curves, which represent the win probabilities at each value of score difference, end up crossing one another. But again, just as we know that *Pwin* must change smoothly and continuously with changes in *mins_remaining*, we also know that as *score_add* goes up, *Pwin* must go up. That means that the curves of *Pwin *as a function of *mins_remaining* at different levels of *score_add *must never cross one another.

So, simply fitting polynomials to these curves didn’t work. But consider what was just discussed here: assorted things that we know must hold true about the *Pwin = f(score_add, mins_remaining) *function were used to reject a way to model the data (with polynomials). We can consider other things we know about how this function must behave to select different curves to fit or model the data. That is, we will select basis functions that better describe our phenomena, and then use the gathered data to determine undetermined coefficients of the bases functions.

**A Cleverer Approach**

When considering all of the things we know to be true about how *Pwin* should behave, it seems that one could calculate win probability (*Pwin*) based on a given game state (*score_advantage* and *mins_remaining*) in a different way. For any/all game states, one could calculate the change in score from the current time to the end of the game (simply *score_advantage_final – score_advantage*). We can call this *delta_score_advantage*. And, we can make the simplifying assumption that the distribution of *delta_score_advantage* should be the same for any game, *regardless of the current score_advantage*. (It is worth thinking hard about what this assumption means, and whether it is absolutely true, but that discussion could take up many pages).

Below is an example of how the *delta_score_advantage* is distributed, for a given narrow range of times remaining in a game.

**Figure 2:** *delta_score_advantage *at narrow range of values for *mins_remaining.* The blue histogram is the actual data. The red line is a fitted Laplace distribution function.

If we can fit a curve to this distribution, then, given the current *score_advantage*, it is an easy matter to determine the probability that the final *score_advantage* is positive. That is, it easy to calculate the *Pwin* probability (it would be the area under the curve in the range where the *delta_score_advantage* would result in the home team’s score would be higher than the away team).

This method has a great advantage. We can avoid trying to impose the complex constraints on my bases functions, to make them fit the ‘shape’ of the data. Instead, we can do a bit of digging, and observe that this distribution seems to be well represented by a Laplace probability density function.

You might ask what is a Laplace distribution function, and why was that selected? A variety of poking around on the internet and google searches, looking for suitable curves that seem to fit the data, may turn up this Laplace distribution. Visually, it turns out that the distribution of *delta_score_advantage* at any selected *mins_remaining* values can be fit quite well with a Laplace probability density function. But more than that, from Wikipedia: “The difference between two independent identically distributed exponential random variables is governed by a Laplace distribution.” If we model the points scored in the time remaining in the game by each team as these two independent identically distributed exponential random variables (are they?), then this suits very well.

The Laplace distribution has two undetermined coefficients: the *location *(the center of the distribution) and the *scale *(how wide the distribution is). All we need to do is determine how the *location* and the *scale* vary as a function of the time remaining.

When we fit Laplace distributions to the set of *delta_score_advantage* values for each *mins_remaining* value, we find that the scale increases smoothly as *mins_remaining* increases. This makes sense: the amount that the score is likely to change in the remainder of the game is going to be greater, when there is more time remaining in the game. Also, we find that the *location *variable is about zero when there is no time remaining, and it is about 4.5 when there are 80 minutes remaining. But why is this? Well, when the game is nearly done, each team has very little opportunity to score more points. So the center of the distribution of the *delta_score_advantage *must be zero when there is no time remaining. But why is the center of the distribution 4.5 points at the beginning of the game? Well, that is the home team advantage, showing up in the data! It turns out, from the opening kickoff, the average change in the score by the end of the game is 4.5 points in favor of the home team.

The variations of the Laplace *location *and *scale *parameters as a function of *mins_remainining* were quite smooth and continuous. Simple curves were fit to the value of *location* as function of *mins_remaining *(a straight line), and the value of *scale *as a function of *mins_remaining *(a third-order polynomial).

So in the end, the resulting model is very simple. All we need are the parameters of this fit line (to predict the home-team advantage remaining in the game), and fit third order polynomial (to predict the range of possible change in score in the remaining time in the game) to determine the Laplace distribution for a given *mins_remaining. *And then with that distribution function, it is easy to calculate the *Pwin* value given the current score.

We can plot the model-predicted win probability curves against the underlying data (that we gathered from our noisy bin values, as described above).

**Figure 3: **Modeled *Pwin *values (curves) vs. bin-by-bin distribution of wins (points) as a function of *mins_remaining. *Each color is a different (range) of values of *score_advantage *(points advantage for the home team).

When we compare the fitted curves of *Pwin* vs. *mins_remaining* for various values of *score_advantage*, we see that the fit curves seem to fit the data reasonably well. In the plot above, only selected values for the *score_advantage* were included, otherwise the graph is just far to cluttered. The “real” data (the dotted lines) are not very smooth, and occasionally cross (which we know actual probability curves should not do). This shows us the challenges of the ‘noisy’ real data.

**Figure 4: **Modeled *Pwin *values (curves) vs. bin-by-bin distribution of wins (points) as a function of *score_advantage. *Each color is a different (range) of values of *mins_remaining*.

When we compare the fitted curves of *Pwin* vs. *score_advantage* for various values of *mins_remaining*, we see that the fit curves seem to fit the data reasonably well. In the plot above, only selected values for the *mins_remaining* were included, otherwise the graph is just far to cluttered. Here, we see that at the extreme left of the graph, when the home team is way behind on the scoreboard, they have very little chance of winning. But their chance of winning goes up slightly, when there is more time left in the game. When the score is close (near the middle of the graph from left-to-right), then the win probability is far less certain, one way or the other. And, we see that the more time there is left in the game, things are even less certain (the *Pwin*) is closer to 0.5. If the score is close, and there is lots of time left, the game is a toss-up. Yep. Of course. The model seems to make sense!

So now, we have what we wanted: a relatively simple bit of math to calculate win probability given the current score-difference and time left in the game. And this math depends upon just a handful of coefficients that were fitted to the data from thousands of top-flight games!

**The Expected Points Model for the Win!**

The model we have described is nice. But how useful is it? If the game is close, and there are just a few minutes left, we might find that the chances of the home team winning (*Pwin*) might be about 0.5. That isn’t rocket science. Any punter could tell you that. But if we combine the use of this model, with what we know about the expected points model, then this can be more powerful.

An __expected points model__ has already previously been described. Basically, for team’s possession for a given location on the field, we can calculate the probability of that team scoring in that possession. Then, for any given action (change in location on the field, or change in possession), we know how the probability of scoring has changed. This change in probability of scoring can be expressed as a points value. For example, if, from a given location on the field (close to the opposition try-line), with a put-in to the scrum, a team might have a 20% chance of scoring a try. So the expected points of the possession is 1 point. But, if the opposition have a put-in to the scrum at that same location, then the (attacking) team may close to 0% chance of scoring. This means that knocking the ball on at the base of the scrum (resulting in that change in possession) has a change of expected points of -1 points. Not desirable.

Now, if your team is 25 points up, and there are 10 minutes left in the game, then a change of (expected) points of -1 point is no big deal. We can calculate the probability of winning when 25 points up, with 10 minutes left in the game, and the probability of winning when 24 points up with 10 minutes left, and we will find that the change in *Pwin* is tiny. It’s disappointing, but in the big picture, it doesn’t matter.

But, if the score is tied, and there is 1 minute left in the game… that is a different matter. The probability of winning when tied, with 1-minute left is about 0.5. But the probability of winning if you are down 1 point with 1-minute left is considerably less than 0.5! Now, that mistake is a big deal. And our win probability model can quantify just how impactful that mistake probably was.

So this win probability model, when combined with the expected points model can be used to further evaluate the *game impact* of any given action during a game.

Math is fun, right?