Forecasting elections is great fun right up until the results come in. As the results of the U.K. general election are reported in the early morning of May 8, we will learn a lot about whether our forecasts were close. But how will we know if we did a good job?
This is not like a U.S. presidential election, where there’s enough polling and few enough states that it’s possible to correctly predict the winner in every state, as several forecasters did in 2012. Last year, FiveThirtyEight published a guide to assessing 2014 Senate forecasts, and here we’re going to assess our own U.K. general election predictions. As there are 632 constituencies in Great Britain,1 a large number of parties, and not that many polls at the constituency level, we’ll have to set our sights a bit lower than perfection. What should we be looking at to evaluate the forecasts? How far off should we expect them to be?
Perhaps the most fundamental result of the election, and certainly the most high-profile one, is whether the Conservatives or the Labour Party wins the most seats. While this may not fully determine who forms the next government, especially since more of the smaller parties are willing to support a Labour government than a Conservative one, it will be the result touted in all the headlines the next morning. But even though the overall seat tally is one of the most important outcomes of the election, it’s also a bad test of a forecasting model. The problem is that there are really only two possibilities. If your forecasting model gives a substantial probability of either Labour or the Conservatives having the most seats (as ours has throughout the campaign), you can neither be very wrong nor very right. It is as if you pulled a die out of your pocket, confidently announced that you believed there was only a one in six chance that you would roll a six, and predicted that the next roll would not be a six. This is a sensible forecast regardless of what actually happens when you roll the die.
In this example, to properly evaluate the key part of the forecast — that there is a one in six chance of rolling a six — you would need to repeat the exercise many times. Unfortunately, we only get one election result, at least if we focus on the national outcome. This means that if we’re going to evaluate whether a forecast was good, we need to dig a bit deeper into the results.
There are a few ways we can do this kind of evaluation. One is to look at the forecast number of seats for each party. After the last election in 2010, the site Political Betting put together a ranking of forecasts based on the sum of the absolute values of the differences between the forecast seats and the actual seats for the Conservatives, the Labour Party and the Liberal Democrats. This absolute seat error is not the only such measure of error we could use, but it’s a reasonable one, so we’ll stick with it for now. Of the six published pre-election forecasts that Political Betting compared in 2010, the best had a total seat error of 51 and the worst — none other than FiveThirtyEight — had a total seat error of 105.2
We can also use this measure to calculate how wrong we expect be on the day of the election: Because our forecasting method creates lots and lots of simulated elections, we can work out how far each simulated election is from our “best guess” seat forecasts.3 When we were preparing our forecasting model, we applied it (as best we could given the differences in available data sources) to the 2010 election. For that “retrocast,” our three-party absolute seat error was 19, compared to our expected error of 57. However, there was a lot of uncertainty in that retrocast because there was so much less data available in 2010—errors ranged from 12 up to 121 seats in the 90 percent prediction interval. Our retrocast was a bit lucky, in that 19 was at the low end of the 90 percent prediction interval.4
Given the rise of the Scottish National Party, it doesn’t make sense to look only at errors in the traditional top three parties, so we’ll consider all parties in Great Britain instead. If we then look at the distribution of the absolute seat errors that result from comparing each of those simulated elections to our current forecast (as of Sunday, May 3), we get a somewhat narrower expected range than for the 2010 retrocast because we have more data. Ninety percent of the errors are between 10 and 82 seats, and the median absolute seat error is 34 seats. That means that even if our model is working exactly as intended, we still have only about a 50-50 chance of having an absolute seat error less than 34 seats across all parties. If we do much better than that, it either means our forecast was good and we were a little lucky, or our forecast was bad and we were very lucky. Similarly, if we do much worse than 34, either our forecast was good and we were unlucky or our forecast was not very good.
The seat totals are very important politically, but when it comes to evaluating forecasts, there’s an advantage to digging deeper and looking at specific seats. To see why, imagine a two-party election in which you predict both parties get the same number of seats, and come election day, both parties do in fact get the same number of seats, but you got every individual seat prediction wrong. If the errors exactly balance out like this, you would get what looks like good results (at least on a macro level) from what is clearly a terrible forecast model.
So another measure we might want to look at is how many individual seats we incorrectly predict. We can again look at how well our simulations imply we ought to do at this: We find that we are most likely to predict 30 individual seats incorrectly, with a 90 percent chance that we will predict between 20 and 50 seats incorrectly. We’d be very happy to end up in that range, since even if we got 50 seats wrong, we would have correctly predicted 92 percent of seats in Great Britain.
One limitation of this kind of analysis is that it doesn’t take into account the probabilities we assign to each outcome, so it treats a weak prediction exactly the same as a strong one. For example, if we give Conservatives a 51 percent chance of winning one seat and a 99 percent chance of winning another, the model just marks each of those seats as a predicted win for the Conservatives.
One way to address this is by calculating a multi-category Brier score. The Brier score for each seat is based on the square of the difference between what actually happens and the probabilities that we predicted for each possible outcome. With this kind of “proper scoring rule,” stronger predictions make better (lower) scores possible when predictions are correct, but overly confident wrong predictions yield much worse (higher) scores. One way to think of the Brier score in this context is that it is basically still the number of seats incorrectly predicted, but adjusted for the strength of the predictions, so picking the wrong winner for a toss-up seat doesn’t affect the total too much. The best possible score is 0, which would happen if we made 100 percent predictions for every seat and got them all right, and the worst possible score is 632, which is what we would get if we made 100 percent predictions for every seat and got them all wrong. Our median Brier score across all our simulated elections is 43 “seats,” and the 90 percent prediction interval is 31 to 71 “seats.”
We’ll return to all of these metrics after the election to see how we did compared to how the model expected us to do. Even if our model is working really well, we won’t get every seat right, and the model doesn’t claim we will. And if our model is wrong in some important way (e.g. underestimating the incumbency advantage or failing to account for tactical voting), we could wind up with a worse record than predicted here. But if we’re within the 90 percent ranges for all these statistics, and particularly if we match or outperform the medians from our simulations, we’ll be happy with how we’ve done.
We think it’s important to set out expectations because these kinds of metrics will be used to decide which forecasters did the best in this election. These simulations illustrate that we can’t learn that much by comparing the seat totals from each forecaster, especially when the various forecasts for this election aren’t that far apart. That means that even if our model is perfectly calibrated, we can’t be confident that our best-guess seat total prediction would end up being the closest.
If we go back to the absolute seat error measure, we can calculate the fraction of our simulated elections in which our forecast “beats” other current forecasts, assuming that our model is perfectly calibrated. For example, our forecast performs better than that of Elections Etc (as of May 3) in 63 percent of simulated elections. This is pretty close to the 50 percent you’d get by chance because our uncertainty about the election outcome is substantially larger than the disagreement between our forecast and Elections Etc’s.5
But Elections Etc is far from the only forecast out there, so the chances that our seat forecast is closer to the actual seat totals than all the other forecasts, given that most of the predictions are clustered around similar numbers, is not very high. Individual seat predictions and Brier scores are a better test, but they can still be affected by luck, and any given forecast might be unlucky on May 7.
We’re not just making excuses! This logic cuts the other way as well. If we are closer than other forecasters in one (or even all) of these measures, that will not necessarily imply our model is better. The trouble with election forecasting is that elections are rare, and each one is unique. We can only learn so much about the performance of our forecasting method from a single election. Unfortunately, this also means that we learn less than we would like about how to improve our approach in the future.