Why are we making probabilistic election forecasts? (and why don’t we put so much effort into them?)

Statistical Modeling, Causal Inference, and Social Science 2024-08-30

Political scientists Justin Grimmer, Dean Knox, and Sean Westwood released a paper that begins:

Probabilistic election forecasts dominate public debate, drive obsessive media discussion, and influence campaign strategy. But in recent presidential elections, apparent predictive failures and growing evidence of harm have led to increasing criticism of forecasts and horse-race campaign coverage. Regardless of their underlying ability to predict the future, we show that society simply lacks sufficient data to evaluate forecasts empirically. Presidential elections are rare events, meaning there is little evidence to support claims of forecasting prowess. Moreover, we show that the seemingly large number of state-level results provide little additional leverage for assessment, because determining winners requires the weighted aggregation of individual state winners and because of substantial within-year correlation.

I agree with all of that. There’s too much horse-race coverage—as we wrote many years ago, “perhaps journalists and then the public will understand that the [general election polls for president] are not worthy of as much attention as they get.” I also agree about the small sample limiting what we can learn about election forecast accuracy from data alone.

Grimmer et al. then continue:

We demonstrate that scientists and voters are decades to millennia away from assessing whether probabilistic forecasting provides reliable insights into election outcomes.

I think that’s an overstatement, and I’ll explain in the context of going through the Grimmer et al. paper. I don’t think this is a bad paper, I just have a disagreement with how they frame part of the question.

Right at the beginning, they write about “recent failures” of election forecasts. I guess that 2016 represents a failure in that some prominent forecasts in the news media were giving Clinton a 90% chance of winning the electoral vote, and she lost. I wouldn’t say that 2020 was such a failure: forecasts correctly predicted the winner, and the error in the predicted vote share (about 2 percentage points of the vote) was within forecast uncertainties. Julia Azari and I wrote about 2016 in our paper, 19 things we learned from the 2016 election, and I wrote about 2020 in the paper, Failure and success in political polling and election forecasting (which involved a nightmare experience with a journal—not the one that ultimately published it). I’m not pointing to those article because they need to be cited–Grimmer et al. already generously cite me elsewhere!—but just to give a sense that the phrase “recent failures” could be misinterpreted.

Also if you’re talking about forecasting, I strongly recommend Rosenstone’s classic book on forecasting elections, which in many ways was the basis of my 1993 paper with King. See in particular table 1 of that 1993 paper, which is relevant to the issue of fundamentals-based forecasts outperforming poll-based forecasts. I agree completely with the larger point of Grimmer et al. that this table is just N=3; it shows that those elections were consistent with forecasts but it can’t allow us to make any strong claims about the future. That said, the fact that elections can be predicted to within a couple percentage points of the popular vote given information available before the beginning of the campaign . . . that’s an important stylized fact about U.S. general elections for president, not something that’s true in all countries or all elections (see for example here).

Grimmer et al. are correct to point out that forecasts are not just polls! As I wrote the other day, state polls are fine, but this whole obsessing-over-the-state-polls is getting out of control. They’re part of a balanced forecasting approach (see also here) which allows for many sources of uncertainty. Along these lines, fundamentals-based forecasts are also probabilistic, as we point out in our 1993 paper (and is implicitly already there in the error term of any serious fundamentals-based model).

Getting back to Rosenstone, they write that scientists are decades away from knowing if “probabilistic forecasting is more accurate than uninformed pundits guessing at random.” Really??? At first reading I was not quite sure what they meant by “guessing at random.” Here are two possibilities; (1) the pundits literally say that the two candidates are equally likely to win, (2) the pundits read the newspapers, watch TV, and pull guesses out of their ass in some non-algorithmic way.

If Grimmer et al. are talking about option 1, I think we already know that probabilistic forecasting is more accurate than a coin flip: read Rosenstone and then consider the followup elections of 1984, 1988, 1992, and 1996, none of which were coin flips and all of which were correctly predicted by fundamentals (and, for that matter, by late polls). From 2000 on, all the elections have been close (except 2008, and by historical standards that was pretty close too), so, sure, in 2000, 2004, 2012, 2016, and 2020, the coin-flip forecast wasn’t so bad. But then their argument is leaning very strongly on the current condition of elections being nearly tied.

If they’re talking about option 2, then they have to consider that, nowadays, even uninformed pundits are aware of fundamentals-based ideas of the economy and incumbency, and of course they’re aware of the polls. So, in that sense, sure, given that respected forecasts exist and pundits know about them, pundits can do about as well as forecasts.

OK, now I see in section 3 of their paper that by “guessing at random,” Grimmer et al. really are talking about flipping a coin. I disagree with the method they are using in section 3—or, I should say, I’m not disagreeing with their math; rather, I think the problem is that they’re evaluating each election outcome as binary. But some elections have more information than others. Predicting 1984 or 1996 as a coin flip would be ridiculous. The key here is that forecasters predict the popular and electoral vote margins, not just the winner (see here).

I also don’t see why they are demanding the forecast have a 95% chance of having a higher score, but I guess that’s a minor point compared to the larger issue that forecasts should be evaluated using continuous outcomes.

Finally, at the end they ask, why are we making probabilistic forecasts? I have some answers, other than “the lucrative marketing of statistical expertise.” First, political science. Rosenstone made a probabilistic forecasting model back in 1983, and we used an improved version of that model for our 1993 paper. The fact that U.S. general elections for president are predictable, within a few percentage points, helps us understand American politics. Second, recall baseball analyst Bill James’s remark that the alternative to good statistics is not “no statistics,” it’s “bad statistics.” Political professionals, journalists, and gamblers are going to make probabilistic forecasts one way or another; fundamentals-based models exist, polls exist . . . given that this information is going to be combined in some way, I don’t think there’s any shame in trying to do it well.

In summary, I agree with much of what Grimmer et al. have to say. We can use empirical data to shoot down some really bad forecasting models such as those that were giving Hillary Clinton a 99% chance of winning in 2016 (something that can happen from a lack of appreciation for non-sampling error in polls, a topic that has been studied quantitatively for a long time (see for example this review by Ansolabehere and Belin from 1993), and other times we can see mathematical or theoretical problems even before the election data come in (for example this from October 2020), but once we narrow the field to reasonable forecasts, it’s pretty much impossible to choose between them on empirical grounds. This is a point I made here and here; again, my point in giving all these links is to avoid having to restate what I’ve already written, I’m not asking them to cite all these things.

I sent the above to Grimmer et al., who responded:

To your point about elections from 30-40 years ago—sure, the forecasts from then look reasonable in retrospect. But again, as our calculations show, we need more information to distinguish those forecasts from other plausible forecasts. Suppose a forecaster like Rosenstone is accurate in his election predictions 85% of the time. It would take 48 years to distinguish his forecast from the 50% accuracy pundit (on average). This would mean that, on average, if Rosenstone started out of sample forecasting in 1980, then in 2028 we’d finally be able to distinguish from the 50% correct pundit. If we built a baseline pundit with more accuracy (say, accounting for obvious elections like you suggest) it would take even longer to determine whether Rosenstone is more accurate than the pundit.

I agree regarding the comparison to the baseline pundit, as this pundit can pay attention to the polls and prediction markets, read election forecasts, etc. The pundit can be as good as a forecast simply by repeating the forecast itself! But my point about Rosenstone is not that his model predicts the winner 85% of the time; it’s that his model (or more improved versions of it) predict the vote margin to a degree of accuracy that allows us to say that the Republicans were heavily favored in 1984 and 1988, the Democrats were favored in 1992 and 1996, etc. Not to mention the forecasts for individual states. Coin flipping only looks like a win if you collapse election forecasting to binary outcomes.

So I disagree with Grimmer et al.’s claim that you would need many decades of future data to learn that a good probabilistic forecast is better than a coin flip. But since nobody’s flipping coins in elections that are not anticipated to be close, this is a theoretical disagreement without practical import.

So, after all that, my summary is that, by restricting themsleves to evaluating forecasts in a binary way, they’ve overstated their case, and I disagree with their negative attitude about forecasting, but, yeah, we’re not gonna ever have the data to compare different reasonable forecasting methods on straight empirical grounds—and that doesn’t even get into the issue that the forecasts keep changing! I’m working with the Economist team and we’re doing our best, given the limited resources we’ve allocated to the problem, but I wouldn’t claim thatg our models, or anyone else’s, “are the best”; there’s just no way to know.

The other thing that’s relevant here is that not much effort is being put into these forecasts! These are small teams! Ben Goodrich and I have helped out the Economist as a little side gig, and the Economist hasn’t devoted a lot of internal resources to this either. I expect the same is true of Fivethirtyeight and other forecasts. Orders of magnitude more time and money is spent on polling (not to mention campaigns’ private polls and focus groups) than on statistical analysis, poll aggregation, fundamentals models, and the rest. Given that the information is out there, and it’s gonna be combined in some way, it makes sense that a small amount of total effort is put into forecasting.

In that sense, I think we already are at the endgame that Grimmer et al. would like: some version of probabilistic forecasting is inevitable, there’s a demand for it, so a small amount of total resources are spent on it. I get the sense that they think probabilistic forecasts are being taken too seriously, but given that these forcasts currently show a lot of uncertainty (for example, the Economist forecast currently has the race at 60/40), I’d argue that they’re doing their job in informing people about uncertainty.

Prediction markets

I sent the above discussion to Rajiv Sethi, an economist who studies prediction markets. Sethi points to this recent paper, which begins:

Any forecasting model can be represented by a virtual trader in a prediction market, endowed with a budget, risk preferences, and beliefs inherited from the model. We propose and implement a profitability test for the evaluation of forecasting models based on this idea. The virtual trader enters a position and adjusts its portfolio over time in response to changes in the model forecast and market prices, and its profitability can be used as a measure of model accuracy. We implement this test using probabilistic forecasts for competitive states in the 2020 US presidential election and congressional elections in 2020 and 2022, using data from three sources: model-based forecasts published by The Economist and FiveThirtyEight, and prices from the PredictIt exchange. The proposed approach can be applied more generally to any forecasting activity as long as models and markets referencing the same events exist.

Sethi writes:

I suspect that the coin flip forecaster would lose very substantial sums.

Joyce Berg and colleagues have been looking at forecasting accuracy of prediction markets for decades, including the IEM vote share markets. This survey paper is now a bit dated but has vote share performance relative to polls for elections in many countries.

They are looking at markets rather than models but the idea that we don’t have enough data to judge would seem to apply to both.

I think these comparisons need to deal with prediction markets, the implicit suggestion in the paper (I think) is that we don’t know (and will never know) whether they can be beaten by coin flippers, and I think we do know.

Yes, as discussed above, I think the problem is that Grimmer et al. were only using binary win/loss outcomes in the analysis where they compared forecasts to coin flips. Throwing away the information on vote margin is going to make it much much harder to distinguish an informative forecast from noise.

Commercial election forecasting

I sent the above discussion to Fivethirtyeight’s Elliott Morris, who wrote:

It’s interesting to see how academics answer the questions we ask ourselves all the time in model development and forecast evaluation. Whether we are better than dart-throwing is a very important question.

2. I’m reasonably confident we are better. As a particularly salient example, a monkey does not know that Wyoming is (practically speaking) always going to be red and CT always blue in 2024. Getting one of those states wrong would certainly erase any gains in accuracy (take the Brier score) from reverting probabilities towards 50-50 in competitive states.

3. Following from that, it seems a better benchmark (what you might call the “smarter pundit” model) would be how the state voted in the last election—or even better, p(win | previous win + some noise). That might still not replicate what pundits are doing but I’d find losing to that hypothetical pundit more troubling for the industry.

4. Could you propose a method that grades the forecasters in terms of distance from the result on vote share grounds? This is closer to how we think of things (we do not think of ourselves as calling elections) and adds some resolution to the problem and I imagine we’d see separation between forecasters and random guessing (centered around previous vote, maybe), much sooner (if not practically immediately).

5. Back on the subject of how we grade different forecasts, we calculate the LOOIC of candidate models on out-of-sample data. Why not create the dumb pundit model in Stan and compare information criterion in a Bayesian way? I think this would augment the simulation exercise nicely.

6. Bigger picture, I’m not sure what you’re doing is really grading the forecasters on their forecasting skill. Our baseline accuracy is set by the pollsters, and it is hard to impossible to overcome bias in measurement. So one question would be whether pollsters beat random guessing. Helpfully you have a lot more empirical data there to test the question. Then, if polls beat the alternative in long-range performance (maybe assign binary wins/losses for surveys outside the MOE?), and pollsters don’t, that is a strong indictment.

7. An alternative benchmark would be the markets. Rajiv’s work finds traders profited off the markets last year if they followed the models and closed contracts before expiration. Taking this metaphor: If you remove assignment risk from your calculations, how would a new grading methodology work? Would we need a hundred years of forecasts or just a couple cycles of beating the CW?

Morris’s “smarter pundit” model in his point #3 is similar to the fundamentals-based models that combine national and state predictors, including past election results. This is what we did in our 1993 paper (we said that elections were predictable given information available ahead of time, so we felt the duty to make such a prediction ourselves) and what is done in a much improved way to create the fundamentals-based forecast for the Economist, Fivethirtyeight, etc.

Political science

Above I wrote about the relevance of effective forecasting to our understanding of elections. Following up on Sethi’s mention of the Berg et al.’s research on prediction markets and polls, political scientist Chris Wlezien points us to two papers with Bob Erikson.

Are prediction markets really superior to polls as election predictors?:

We argue that it is inappropriate to naively compare market forecasts of an election outcome with exact poll results on the day prices are recorded, that is, market prices reflect forecasts of what will happen on Election Day whereas trial-heat polls register preferences on the day of the poll. We then show that when poll leads are properly discounted, poll-based forecasts outperform vote-share market prices. Moreover, we show that win projections based on the polls dominate prices from winner-take-all markets.

Markets vs. polls as election predictors: An historical assessment:

When we have both market prices and polls, prices add nothing to election prediction beyond polls. To be sure, early election markets were (surprisingly) good at extracting campaign information without scientific polling to guide them. For more recent markets, candidate prices largely follow the polls.

This relates to my point above that one reason people aren’t always impressed by poll-based forecasts is that, from the polls, they already have a sense of what they expect will happen.

I sent the above discussion to economist David Rothschild, who added that, even beyond whatever predictive value they give,

Polling data (and prediction market data) are valuable for political scientists to understand the trajectory and impact of various events to get to the outcome. Prediction markets (and high frequency polling) in particular allow for event studies.

Good point.

Rothschild adds:

Duncan Watts and I have written extensively on the massive imbalance of horse-race coverage to policy coverage in Mainstream Media election coverage. Depending on how you count it, no more than 5-10% of campaign coverage, even at the New York Times, covers policy in any remotely informative way. There are a myriad of reasons to be concerned about the proliferation of horse-race coverage, and how it is used to distract or misinform news consumers. But, to me, that seems like a separate question from making the best forecasts possible from the available data (how much horse-race coverage should we have), rather than reverting to earlier norms of focusing on over individual polls without context (conditional on horse-race coverage, should we make it as accurate and contextualized as possible).

Summary

I disagree with Grimmer et al. that we can’t distinguish probabilistic election forecasts from coin flips. Election forecasts, at the state and national level, are much better than coin flips, as long as you include non-close elections such as lots of states nowadays and most national elections before 2000. If all future elections are as close in the electoral college as 2016 and 2020, then, sure, the national forecasts aren’t much better than coin flips, but then their conclusion is very strongly leaning on that condition. In talking about evaluation of forecasting accuracy, I’m not offering a specific alternative here–my main point is that the evaluation should use the vote margin, not just win/loss. When comparing to coin flipping, Grimmer et al. only look at predicting the winner of the national election, but when comparing forecasts, they also look at electoral vote totals.

I agree with Grimmer et al. that it is essentially impossible from forecasting accuracy alone to choose between reasonable probabilistic forecasts (such as those from the Economist and Fivethirtyeight in 2020 and 2024, or from prediction markets, or from fundamentals-based models in the Rosenstone/Hibbs/Campbell/etc. tradition). N is just too small, also the models themselves along with the underlying conditions change from election to election, so it’s not even like there are stable methods to make such a comparison.

Doing better than coin flipping is not hard. Once you get to a serious forecast using national and state-level information and appropriate levels of uncertainty, there are lots of ways to go, there are reasons to choose one forecast over another based on your take on the election, but you’re not gonna be able to empirically rate them based on forecast accuracy, a point that Grimmer et al. make clearly in their Table 2.

Grimmer et al. conclude:

We think that political science forecasts are interesting and useful. We agree that the relatively persistent relationship between those models and vote share does teach us something about politics. In fact when one of us (Justin) teaches introduction to political science, his first lecture focuses on these fundamental only forecasts. We also agree it can be useful to average polls to avoid the even worse tendency to focus on one or two outlier polls and overinterpret random variation as systematic changes.

It is a leap to go from the usefulness of these models for academic work or poll averaging to justifying the probabilities that come from these models. If we can never evaluate the output of the models, then there is really no way to know if these probabilities correspond to any sort of empirical reality. And what’s worse, there is no way to know that the fluctuations in probability in these models are any more “real” than the kind of random musing from pundits on television.

OK, I basically agree (even if I think “there is really no way to know if these probabilities correspond to any sort of empirical reality” is a slight overstatement).

Grimmer et al. are making a fair point. My continuation of their point is to say that this sort of poll averaging is gonna be done, one way or another, so it makes sense to me that news organizations will try to do it well. Which in turn should allow the pundits on television to be more reasonable. I vividly recall 1988, when Dukakis was ahead in the polls but my political scientist told be that Bush was favored because the state of the economy (I don’t recall hearing the term “fundamentals” before our 1993 paper came out). The pundits can do better now, but conditions have changed, and national elections are much closer.

All this discussion is minor compared to horrors such as election denial (Grimmer wrote a paper about that too), and I’ll again say that the total resources spent on probabilistic forecasting is low.

One thing I think we can all agree on is that there are better uses of resources than endless swing-state and national horserace polls, and that there are better things for political observers to focus on than election forecasts. Ideally, probabilistic forecasts should help for both these things, first by making it clear how tiny the marginal benefit is from each new poll, and second by providing wide enough uncertainties that people can recognize that the election is up in the air and it’s time to talk about what the candidates might do if they win. Unfortunately, poll averaging does not seem to have reduced the attention being paid to polls, and indeed the existence of competing forecasts just adds drama to the situation. Which perhaps I’m contributing to, even while writing a post saying that there are too many polls and that poll aggregation isn’t all that.

Let me give the last word to Sean Westwood (the third author of the above-discussed paper), who writes:

Americans are confused by polls and even more confused by forecasts. A significant point in our work is that without an objective assessment of performance, it is unclear how Americans should evaluate these forecasts. Is being “right” in a previous election a sufficient reason to trust a forecaster or model? I do not believe this can be the standard. Lichtman claims past accuracy across many elections, and people evaluated FiveThirtyEight in 2016 with deference because of their performance in 2008 and 2012. While there is value in past accuracy, there is no empirical reason to assume it is a reliable indicator of overall quality in future cycles. We might think it is, but at best this is a subjective assessment.

Agreed.