What are the odds, II: the Venezuelan presidential election

What's new 2024-08-02

In a previous blog post, I discussed how, from a Bayesian perspective, learning about some new information {E} can update one’s perceived odds {{\mathbb P}(H_1) / {\mathbb P}(H_0)} about how likely an “alternative hypothesis” {H_1} is, compared to a “null hypothesis” {H_0}. The mathematical formula here is

\displaystyle  \frac{{\mathbb P}(H_1|E)}{{\mathbb P}(H_0|E)} = \frac{{\mathbb P}(E|H_1)}{{\mathbb P}(E|H_0)} \frac{{\mathbb P}(H_1)}{{\mathbb P}(H_0)}. \ \ \ \ \ (1)

Thus, provided one has
  • (i) A precise formulation of the null hypothesis {H_0} and the alternative hypothesis {H_1}, and the new information {E};
  • (ii) A reasonable estimate of the prior odds {{\mathbb P}(H_1) / {\mathbb P}(H_0)} of the alternative hypothesis {H_1} being true (compared to the null hypothesis {H_0});
  • (iii) A reasonable estimate of the probability {{\mathbb P}(E|H_0)} that the event {E} would occur under the null hypothesis {H_0}; and
  • (iv) A reasonable estimate of the probability {{\mathbb P}(E|H_1)} that the event {E} would occur under the alternative hypothesis {H_1},
one can then obtain an estimate of the posterior odds {{\mathbb P}(H_1|E) / {\mathbb P}(H_0|E)} of the alternative hypothesis being true after observing the event {E}. Ingredient (iii) is usually the easiest to compute, but is only one of the key inputs required; ingredients (ii) and (iv) are far trickier, involve subjective non-mathematical considerations, and as discussed in the previous post, depend rather crucially on ingredient (i). For instance, selectively reporting some relevant information for {E}, and witholding other key information from {E}, can make the final mathematical calculation {{\mathbb P}(H_1|E) / {\mathbb P}(H_0|E)} misleading with regard to the “true” odds of {H_1} being true compared to {H_1} based on all available information, even if the calculation was technically accurate with regards to the partial information {E} provided. Also, the computation of the relative odds of two competing hypotheses {H_0} and {H_1} can become somewhat moot if there is a third hypothesis {H_2} that ends up being more plausible than either of these two hypotheses. Nevertheless, the formula (1) can still lead to useful conclusions, albeit ones that are qualified by the particular assumptions and estimates made in the analysis.

At a qualitative level, the Bayesian identity (1) is telling us the following: if an alternative hypothesis {H_1} was already somewhat plausible (so that the prior odds {{\mathbb P}(H_1) / {\mathbb P}(H_0)} was not vanishingly small), and the observed event {E} was significantly more likely to occur under hypothesis {H_1} than under {H_0}, then the hypothesis {H_1} becomes significantly more plausible (in that the posterior odds {{\mathbb P}(H_1|E) / {\mathbb P}(H_0|E)} become quite elevated). This is quite intuitive, but as discussed in the previous post, a lot hinges on how one is defining the alternative hypothesis {H_1}.

In the previous blog post, this calculation was initially illustrated with the following choices of {H_0}, {H_1}, and {E} (thus fulfilling ingredient (i)):

  • {E} was the event that the October 1, 2022 PSCO Grand Lotto in the Phillippines drew the numbers {9, 18, 27, 36, 45, 54} (that is to say, consecutive multiples if {9}), though not necessarily in that order;
  • {H_0} was the null hypothesis that the lottery was fair and the numbers were drawn uniformly at random (without replacement) from the set {\{1,\dots,55\}}; and
  • {H_1} was the alternative hypothesis that the lottery was rigged by some corrupt officials for their personal gain.
In this case, ingredient (iii) can be computed mathematically and precisely:

\displaystyle  {\mathbb P}(E|H_0) = \frac{1}{\binom{55}{6}} = \frac{1}{28,989,675}. \ \ \ \ \ (2)

And, with a not inconceivable level of cynicism about the integrity of the lottery, the prior odds (ingredient (ii)) can be argued to be non-negligible. However, ingredient (iv) was nearly impossible to estimate: indeed, as argued in that post, there is no reason to suspect that {{\mathbb P}(E|H_1)} is much larger than the tiny probability (2), and in fact it could well be smaller (since would likely be in the interest of corrupt lottery officials to not draw attention to their activities). So, despite the very small numerical value of the probability (2), this did not lead to any significant increase in the odds of the alternative hypothesis. In the previous blog post, several other variants {H'_1}, {H''_1}, {H'''_1} of the alternative hypothesis {H_1} were also discussed; the conclusion was that while some choices of alternative hypothesis could lead to elevated probabilities for ingredient (iv), they came at the cost of significantly reducing the prior odds in ingredient (ii), and so no alternative hypothesis was located which ended up being significantly more plausible than the null hypothesis {H_0} after observing the event {E}.

In this post, I would like to run the same analysis on a numerical anomaly in the recent Venezuelan presidential election of June 28, 2024. Here are the officially reported vote totals for the two main candidates, incumbent president Nicolás Maduro and opposition candidate Edmundo Gonzáles, in the election:

  • Maduro: 5,150,092 votes
  • Gonzáles: 4,445,978 votes
  • Other: 462,704 votes
  • Total: 10,058,774 votes.
The numerical anomaly is that if one multiplies the total number of voters {10,058,774} by the round percentages {51.2\%}, {44.2\%}, {4.6\%}, one recovers exactly the above vote counts after rounding to the nearest integer:

\displaystyle  51.2\% \times 10,058,774 = 5,150,092.288

\displaystyle  44.2\% \times 10,058,774 = 4,445,978.108

\displaystyle  4.6\% \times 10,058,774 = 462,703.604.

Let us try to apply the above Bayesian framework to this situation, bearing in mind the caveats that this analysis is only strong as the inputs supplied and assumptions made (for instance, to simplify the discussion, we will not also discuss information from exit polling, which in this case gave significantly different predictions from the percentages above).

The first step (ingredient (i)) is to formulate the null hypothesis {H_0}, the alternative hypothesis {H_1}, and the event {E}. Here is one possible choice:

  • {E} is the event that the reported vote total for Maduro, Gonzáles, and Other are all equal to the nearest integer of the total number of voters, multiplied by a round percentage with one decimal point (i.e., an integer multiple of {0.1\%}).
  • {H_0} is the null hypothesis that the vote totals were reported accurately (or with only inconsequential inaccuracies).
  • {H_1} is the alternative hypothesis that the vote totals were manipulated by officials from the incumbent administration.

Ingredient (ii) – the prior odds that {H_1} is true over {H_0} – is highly subjective, and an individual’s estimation of (ii) would likely depend on, or at least be correlated with, their opinion of the current Venezulan administration. Discussion of this ingredient is therefore more political than mathematical, and I will not attempt to quantify it further here. Now we turn to (iii), the estimation of the probability {{\mathbb P}(E|H_0)} that {E} occurs given the hypothesis {H_0}. This cannot be computed exactly without a precise probabilistic model of the voting electorate, but let us make a rough order of magnitude calculation as follows. One can focus on the anomaly just for the number of votes received by Maduro and Gonzáles, since if both of these counts were the nearest integer to a round percentages then just from simple subtraction the number of votes for “other” would also be forced to also be the nearest integer from a round percentage, possibly plus or minus one due to carries, so up to a factor of two or so we can ignore the latter anomaly. As a simple model, suppose that the voting percentages for Maduro and Gonzáles were distributed more or less uniformly in some square {[p-\varepsilon,p+\varepsilon] \times [q-\varepsilon,q+\varepsilon]}, where {p, q} are some proportions not too close to either {0} or {1}, and {\varepsilon} is some reasonably large margin of error (the exact values of these parameters will end up not being too important, nor will the specific shape of the distribution; indeed, the shape and size of the square here only impacts the analysis through the area {(2\varepsilon)^2} of the square, and even this quantity cancels itself out in the end). Thus, the number of votes for Maduro is distributed in an interval of length about {2\varepsilon N}, where {N = 10,058,774} is the number of voters, and similarly for Gonzáles, so the total number of different outcomes here is {(2\varepsilon N)^2}, and by our model we have a uniform distribution amongst all these outcomes. On the other hand, the total number of attainable round percentages for Maduro is about {(2\varepsilon) / 0.1\% = 1000 \times 2\varepsilon}, and similarly for Gonzáles, so our estimate for {{\mathbb P}(E|H_0)} is

\displaystyle {\mathbb P}(E|H_0) \approx \frac{(1000 \times 2\varepsilon)^2}{(2\varepsilon N)^2} = (1000/N)^2 \approx 10^{-8}.

This looks quite unlikely! But we are not done yet, because we also need to estimate {{\mathbb P}(E|H_1)}, the probability that the event {E} would occur under the alternative hypothesis {H_1}. Here one has to be careful, because while it could happen under hypothesis {H_1} that the vote counts were manipulated to be exactly the nearest integer to a round percentage, this is not the only outcome under this hypothesis, and indeed one could argue that it would not be in the interest of an administration to generate such a striking numerical anomaly. But one can create a reasonable chain of events with which to estimate (from below) this probability by a kind of “Drake equation“. Consider the following variants of {H_1}:
  • {H'_1}: {H_1} is true, and the administration directs election officials to report vote outcomes with some explicitly preferred (round) percentages, regardless of the actual election results.
  • {H''_1}: {H'_1} is true, and the election officials dutifully generate a report by multiplying these preferred percentages by the total number {N} of voters, and rounding to the nearest integer, without any attempt to disguise their actions.
By the chain rule for conditional probability, one has a lower bound

\displaystyle  {\mathbb P}(E|H_1) \geq {\mathbb P}(E, H''_1, H'_1|H_1) = {\mathbb P}(E|H''_1) {\mathbb P}(H''_1|H'_1) {\mathbb P}(H'_1|H_1).

Inserting this into (1), we obtain our final lower bound:

\displaystyle  \frac{{\mathbb P}(H_1|E)}{{\mathbb P}(H_0|E)} \gtrapprox 10^8 \times {\mathbb P}(E|H''_1) {\mathbb P}(H''_1|H'_1) {\mathbb P}(H'_1|H_1) \frac{{\mathbb P}(H_1)}{{\mathbb P}(H_0)}.

This is about as far as one can get purely with mathematical analysis. Beyond this, one has to make some largely subjective estimations for each of the remaining probabilities and odds in this formula. As mentioned, the prior odds {{\mathbb P}(H_1)/{\mathbb P}(H_0)} will likely depend on the individual making this calculation, and will not be discussed further here. The remaining question then is how large the probabilities {{\mathbb P}(H'_1|H_1)}, {{\mathbb P}(H''_1|H'_1)}, and {{\mathbb P}(E|H''_1)} are. In other words:
  • If one assumes that the administration wishes to manipulate the vote totals, how likely is it a priori (i.e., without being aware of the anomaly {E}) that they would do so by explictly selecting preferred round percentages and then requesting that election officials report these percentages?
  • If one assumes that election officials are being ordered to report vote totals to reflect a preferred round percentage, how likely is it a priori that they would follow the orders without question, and performing simple rounding instead of any more sophisticated numerical manipulation?
  • If one assumes that election officials did indeed follow the orders as above, how likely is it a priori that the report would be published as is without any concerns raised by other officials or observers?
If one’s estimate of the product of these three probabilities multiplies to be significantly greater than {10^{-8}}, then we can conclude that the event {E} has indeed significantly raised the odds of some sort of voting manipulation present. The scenario described above is somewhat plausible, especially in light of the anomaly {E}, and so certainly the posterior probabilities {{\mathbb P}(H'_1|H_1,E)}, {{\mathbb P}(H''_1|H'_1,E)} seem quite large (and the posterior probability {{\mathbb P}(E|H''_1,E)} is of course equal to {1}). But it is important here to avoid confirmation bias and work only with a priori probabilities – roughly speaking, the probabilities that one would assign to such events on July 27, 2024, before knowledge of the anomaly {E} came to light. Nevertheless, even after accounting for confirmation bias, I think it is plausible that the above product of a priori probabilities is indeed significantly larger than {10^{-8}} (for instance, to assign probabilities somewhat randomly, this would be the case of each of the conditional probabilities {{\mathbb P}(H'_1|H_1)}, {{\mathbb P}(H''_1|H'_1)} and {{\mathbb P}(E|H''_1)} all exceed {1\%}), giving credence to the theory of the election report being manipulated (though it is possible that the manipulation could occur through a third hypothesis {H_2} not covered by the original two hypotheses, such as a software glitch). If one adds in additional information beyond the purely numerical anomaly {E}, such as the fact that the reported totals were not broken down further by voting district (which would be less likely under hypothesis {H_0} than hypothesis {H_1}), and that exit polls gave significantly different results from the reported totals (which is again less likely under hypothesis {H_0} than hypothesis {H_1}), the evidence for voting irregularities becomes quite significant.

One can contrast this analysis with that of the Phillipine lottery in the original post. In both cases the probability {{\mathbb P}(E|H_0)} of the observed event under the null hypothesis was extremely small. However, in the case of the Venezuelan election, there is a plausible causal chain {H_1 \implies H'_1 \implies H''_1 \implies E} that leads to an elevated probability {{\mathbb P}(E|H_1)} of the observed event under the alternative hypothesis, whereas in the case of the lottery, only extremely implausible chains could be constructed that would lead to the specific outcome of a multiples-of-9 lottery draw for that specific lottery on that specific date.