Intransitive dice VI: sketch proof of the main conjecture for the balanced-sequences model

Gowers's Weblog 2018-03-10

I have now completed a draft of a write-up of a proof of the following statement. Recall that a random n-sided die (in the balanced-sequences model) is a sequence of length n of integers between 1 and n that add up to n(n+1)/2, chosen uniformly from all such sequences. A die (a_1,\dots,a_n) beats a die (b_1,\dots,b_n) if the number of pairs (i,j) such that a_i>b_j exceeds the number of pairs (i,j) such that a_i<b_j. If the two numbers are the same, we say that A ties with B.

Theorem. Let A,B and C be random n-sided dice. Then the probability that A beats C given that A beats B and B beats C is \frac 12+o(1).

In this post I want to give a fairly detailed sketch of the proof, which will I hope make it clearer what is going on in the write-up.

The first step is to show that the theorem is equivalent to the following statement.

Theorem. Let A be a random n-sided die. Then with probability 1-o(1), the proportion of n-sided dice that A beats is \frac 12+o(1).

We had two proofs of this statement in earlier posts and comments on this blog. In the write-up I have used a very nice short proof supplied by Luke Pebody. There is no need to repeat it here, since there isn’t much to say that will make it any easier to understand than it already is. I will, however, mention once again an example that illustrates quite well what this statement does and doesn’t say. The example is of a tournament (that is, complete graph where every edge is given a direction) where every vertex beats half the other vertices (meaning that half the edges at the vertex go in and half go out) but the tournament does not look at all random. One just takes an odd integer n and puts arrows out from x to x+y mod n for every y\in\{1,2,\dots,(n-1)/2\}, and arrows into x for every y\in\{(n+1)/2,\dots,n-1\}. It is not hard to check that the probability that there is an arrow from x to z given that there are arrows from x to y and y to z is approximately 1/2, and this turns out to be a general phenomenon.

So how do we prove that almost all n-sided dice beat approximately half the other n-sided dice?

The first step is to recast the problem as one about sums of independent random variables. Let [n] stand for \{1,2,\dots,n\} as usual. Given a sequence A=(a_1,\dots,a_n)\in[n]^n we define a function f_A:[n]\to[n] by setting f_A(j) to be the number of i such that a_i<j plus half the number of i such that a_i=j. We also define g_A(j) to be f_A(j)-(j-1/2). It is not hard to verify that A beats B if \sum_jg_A(b_j)<0, ties with B if \sum_jg_A(b_j)=0, and loses to B if \sum_jg_A(b_j)>0.

So our question now becomes the following. Suppose we choose a random sequence (b_1,\dots,b_n) with the property that \sum_jb_j=n(n+1)/2. What is the probability that \sum_jg_A(b_j)>0? (Of course, the answer depends on A, and most of the work of the proof comes in showing that a “typical” A has properties that ensure that the probability is about 1/2.)

It is convenient to rephrase the problem slightly, replacing b_j by b_j-(n+1)/2. We can then ask it as follows. Suppose we choose a sequence (V_1,\dots,V_n) of n elements of the set \{-(n-1)/2,-(n-1)/2+1,\dots,(n-1)/2\}, where the terms of the sequence are independent and uniformly distributed. For each j let U_j=g_A(V_j). What is the probability that \sum_jU_j>0 given that \sum_jV_j=0?

This is a question about the distribution of \sum_j(U_j,V_j), where the (U_j,V_j) are i.i.d. random variables taking values in \mathbb Z^2 (at least if n is odd — a small modification is needed if n is even). Everything we know about probability would lead us to expect that this distribution is approximately Gaussian, and since it has mean (0,0), it ought to be the case that if we sum up the probabilities that \sum_j(U_j,V_j)=(x,0) over positive x, we should get roughly the same as if we sum them up over negative x. Also, it is highly plausible that the probability of getting (0,0) will be a lot smaller than either of these two sums.

So there we have a heuristic argument for why the second theorem, and hence the first, ought to be true.

There are several theorems in the literature that initially seemed as though they should be helpful. And indeed they were helpful, but we were unable to apply them directly, and had instead to develop our own modifications of their proofs.

The obvious theorem to mention is the central limit theorem. But this is not strong enough for two reasons. The first is that it tells you about the probability that a sum of random variables will lie in some rectangular region of \mathbb R^2 of size comparable to the standard deviation. It will not tell you the probability of belonging to some subset of the y-axis (even for discrete random variables). Another problem is that the central limit on its own does not give information about the rate of convergence to a Gaussian, whereas here we require one.

The second problem is dealt with for many applications by the Berry-Esseen theorem, but not the first.

The first problem is dealt with for many applications by local central limit theorems, about which Terence Tao has blogged in the past. These tell you not just about the probability of landing in a region, but about the probability of actually equalling some given value, with estimates that are precise enough to give, in many situations, the kind of information that we seek here.

What we did not find, however, was precisely the theorem we were looking for: a statement that would be local and 2-dimensional and would give information about the rate of convergence that was sufficiently strong that we would be able to obtain good enough convergence after only n steps. (I use the word “step” here because we can think of a sum of n independent copies of a 2D random variable as an n-step random walk.) It was not even clear in advance what such a theorem should say, since we did not know what properties we would be able to prove about the random variables (U_i,V_i) when A was “typical”. That is, we knew that not every A worked, so the structure of the proof (probably) had to be as follows.

1. Prove that A has certain properties with probability 1-o(1).

2. Using these properties, deduce that the sum \sum_{i=1}^n(U_i,V_i) converges very well after n steps to a Gaussian.

3. Conclude that the heuristic argument is indeed correct.

The key properties that A needed to have were the following two. First, there needed to be a bound on the higher moments of U. This we achieved in a slightly wasteful way — but the cost was a log factor that we could afford — by arguing that with high probability no value of g_A(j) has magnitude greater than 6\sqrt{n\log n}. To prove this the steps were as follows.

  1. Let A be a random element of [n]^n. Then the probability that there exists j with g_A(j)\geq 6\sqrt{n\log n} is at most n^{-k} (for some k such as 10).
  2. The probability that \sum_ia_i=n(n+1)/2 is at least cn^{-3/2} for some absolute constant c>0.
  3. It follows that if A is a random n-sided die, then with probability 1-o(1) we have |g_A(j)|\leq 6\sqrt{n\log n} for every j.

The proofs of the first two statements are standard probabilistic estimates about sums of independent random variables.

The second property that A needed to have is more difficult to obtain. There is a standard Fourier-analytic approach to proving central limit theorems, and in order to get good convergence it turns out that what one wants is for a certain Fourier transform to be sufficiently well bounded away from 1. More precisely, we define the characteristic function of the random variable (U,V) to be

\hat f(\alpha,\beta)=\mathbb E e(\alpha U+\beta V)=\sum_{x,y}f(x,y)e(\alpha x+\beta y),

where e(x) is shorthand for \exp(2\pi ix), f(x,y)=\mathbb P[(U,V)=(x,y)], and \alpha and \beta range over \mathbb T=\mathbb R/\mathbb Z.

I’ll come later to why it is good for \hat f(\alpha,\beta) not to be too close to 1. But for now I want to concentrate on how one proves a statement like this, since that is perhaps the least standard part of the argument.

To get an idea, let us first think what it would take for |\hat f(\alpha,\beta)| to be very close to 1. This condition basically tells us that \alpha U+\beta V is highly concentrated mod 1: indeed, if \alpha U+\beta V is highly concentrated, then e(\alpha U+\beta V) takes approximately the same value almost all the time, so the average is roughly equal to that value, which has modulus 1; conversely, if \alpha U+\beta V is not highly concentrated mod 1, then there is plenty of cancellation between the different values of e(\alpha U+\beta V) and the result is that the average has modulus appreciably smaller than 1.

So the task is to prove that the values of \alpha U+\beta V are reasonably well spread about mod 1. Note that this is saying that the values of \alpha g_A(j)+\beta(j-(n+1)/2) are reasonably spread about.

The way we prove this is roughly as follows. Let \alpha>0, let m be of order of magnitude \alpha^{-2}, and consider the values of g_A at the four points j, j+m, j+2m and j+3m. Then a typical order of magnitude of g_A(j)-g_A(j+m) is around \sqrt m, and one can prove without too much trouble (here the Berry-Esseen theorem was helpful to keep the proof short) that the probability that

|g_A(j)-g_A(j+m)-g_A(j+2m)+g_A(3m)|\geq c\sqrt m

is at least c, for some positive absolute constant c. It follows by Markov’s inequality that with positive probability one has the above inequality for many values of j.

That’s not quite good enough, since we want a probability that’s very close to 1. This we obtain by chopping up [n] into intervals of length 4m and applying the above argument in each interval. (While writing this I’m coming to think that I could just as easily have gone for progressions of length 3, not that it matters much.) Then in each interval there is a reasonable probability of getting the above inequality to hold many times, from which one can prove that with very high probability it holds many times.

But since m is of order \alpha^{-2}, \alpha\sqrt m is of order 1, which gives that the values e(g_A(j)), e(g_A(j+m)), e(g_A(j+2m), e(g_A(j+3m)) are far from constant whenever the above inequality holds. So by averaging we end up with a good upper bound for |\hat f(\alpha,\beta)|.

The alert reader will have noticed that if \alpha\ll n^{-1/2}, then the above argument doesn’t work, because we can’t choose m to be bigger than n. In that case, however, we just do the best we can: we choose m to be of order n/\log n, the logarithmic factor being there because we need to operate in many different intervals in order to get the probability to be high. We will get many quadruples where

\alpha|g_A(j)-g_A(j+m)-g_A(j+2m)+g_A(3m)|\geq c\alpha\sqrt m=c'\alpha\sqrt{n/\log n},

and this translates into a lower bound for 1-|\hat f(\alpha,\beta)| of order \alpha^2n/\log n, basically because 1-\cos\theta has order \theta^2 for small \theta. This is a good bound for us as long as we can use it to prove that |\hat f(\alpha,\beta)|^n is bounded above by a large negative power of n. For that we need \alpha^2n/\log n to be at least C\log n/n (since (1-C\log n/n)^n is about \exp(-C\log n)=n^{-C}), so we are in good shape provided that \alpha\gg\log n/n.

The alert reader will also have noticed that the probabilities for different intervals are not independent: for example, if some f_A(j) is equal to n, then beyond that g_A(j) depends linearly on j. However, except when j is very large, this is extremely unlikely, and it is basically the only thing that can go wrong. To make this rigorous we formulated a concentration inequality that states, roughly speaking, that if you have a bunch of k events, and almost always (that is, always, unless some very unlikely event occurs) the probability that the ith event holds given that all the previous events hold is at least c, then the probability that fewer than ck/2 of the events hold is exponentially small in k. The proof of the concentration inequality is a standard exponential-moment argument, with a small extra step to show that the low-probability events don’t mess things up too much.

Incidentally, the idea of splitting up the interval in this way came from an answer by Serguei Popov to a Mathoverflow question I asked, when I got slightly stuck trying to prove a lower bound for the second moment of U. I eventually didn’t use that bound, but the interval-splitting idea helped for the bound for the Fourier coefficient as well.

So in this way we prove that |\hat f(\alpha,\beta)|^n is very small if |\alpha|\gg\log n/n. A simpler argument of a similar flavour shows that |\hat f(\alpha,\beta)|^n is also very small if |\alpha| is smaller than this and |\beta|\gg n^{-3/2}.

Now let us return to the question of why we might like |\hat f(\alpha,\beta)|^n to be small. It follows from the inversion and convolution formulae in Fourier analysis. The convolution formula tells us that the characteristic function of the sum of the U_i (which are independent and each have characteristic function \hat f) is (\hat f)^n. And then the inversion formula tells us that

\mathbb P[(\sum_iU_i,\sum_iV_i)=(x,y)]=\int_{(\alpha,\beta)\in\mathbb T^2}\hat f(\alpha,\beta)^ne(-\alpha x-\beta y)\mathop{d\alpha}\mathop{d\beta}

What we have proved can be used to show that the contribution to the integral on the right-hand side from those pairs (\alpha,\beta) that lie outside a small rectangle (of width n^{-1} in the \alpha direction and n^{-3/2} in the \beta direction, up to log factors) is negligible.

All the above is true provided the random n-sided die A satisfies two properties (the bound on \|U\|_\infty and the bound on |\hat f(\alpha,\beta)|), which it does with probability 1-o(1).

We now take a die A with these properties and turn our attention to what happens inside this box. First, it is a standard fact about characteristic functions that their derivatives tell us about moments. Indeed,

\frac{\partial^{r+s}}{\partial^r\alpha\partial^s\beta}\mathbb E e(\alpha U+\beta V)=(2\pi i)^{r+s}\mathbb E U^rV^s e(\alpha U+\beta V),

and when \alpha=\beta=0 this is \mathbb E U^rV^s. It therefore follows from the two-dimensional version of Taylor’s theorem that

\hat f(\alpha,\beta)=1-2\pi^2(\alpha^2\mathbb EU^2+2\alpha\beta\mathbb EUV+\beta^2\mathbb EV^2)

plus a remainder term R(\alpha,\beta) that can be bounded above by a constant times (|\alpha|\|U\|_\infty+|\beta|\|V\|_\infty)^3.

Writing Q(\alpha,\beta) for 2\pi^2(\alpha^2\mathbb EU^2+2\alpha\beta\mathbb EUV+\beta^2\mathbb EV^2) we have that Q is a positive semidefinite quadratic form in \alpha and \beta. (In fact, it turns out to be positive definite.) Provided R(\alpha,\beta) is small enough, replacing it by zero does not have much effect on \hat f(\alpha,\beta)^n, and provided Q(\alpha,\beta)^2 is small enough, (1-Q(\alpha,\beta))^n is well approximated by \exp(-Q(\alpha,\beta)).

It turns out, crucially, that the approximations just described are valid in a box that is much bigger than the box inside which \hat f(\alpha,\beta) has a chance of not being small. That implies that the Gaussian decays quickly (and is why we know that Q is positive definite).

There is a bit of back-of-envelope calculation needed to check this, but the upshot is that the probability that (\sum_iU_i,\sum_iV_i)=(x,y) is very well approximated, at least when x and y aren’t too big, by a formula of the form

G(x,y)=\int\exp(-Q(\alpha,\beta))e(-\alpha x-\beta y)\mathop{d\alpha}\mathop{d\beta}.

But this is the formula for the Fourier transform of a Gaussian (at least if we let \alpha and \beta range over \mathbb R^2, which makes very little difference to the integral because the Gaussian decays so quickly), so it is the restriction to \mathbb Z^2 of a Gaussian, just as we wanted.

When we sum over infinitely many values of x and y, uniform estimates are not good enough, but we can deal with that very directly by using simple measure concentration estimates to prove that the probability that (\sum_iU_i,\sum_iV_i)=(x,y) is very small outside a not too large box.

That completes the sketch of the main ideas that go into showing that the heuristic argument is indeed correct.

Any comments about the current draft would be very welcome, and if anyone feels like working on it directly rather than through me, that is certainly a possibility — just let me know. I will try to post soon on the following questions, since it would be very nice to be able to add answers to them.

1. Is the more general quasirandomness conjecture false, as the experimental evidence suggests? (It is equivalent to the statement that if A and B are two random n-sided dice, then with probability 1-o(1), the four possibilities for whether another die beats A and whether it beats B each have probability \frac 14+o(1).)

2. What happens in the multiset model? Can the above method of proof be adapted to this case?

3. The experimental evidence suggests that transitivity almost always occurs if we pick purely random sequences from [n]^n. Can we prove this rigorously? (I think I basically have a proof of this, by showing that whether or not A beats B almost always depends on whether A has a bigger sum than B. I’ll try to find time reasonably soon to add this to the draft.)

Of course, other suggestions for follow-up questions will be very welcome, as will ideas about the first two questions above.