A toy model for the proof of Shannon’s Noisy Coding Theorem for the BSC

Wildon's Weblog 2019-09-06

In the Binary Symmetric Channel (BSC) with symbol error probability p, an input bit flips with probability p < 1/2. A binary code of length n can be used to communicate on the BSC by sending each bit of a codeword u, collecting the n received bits as a word v still of length n, and then decoding v as the nearest codeword u' to v with respect to Hamming distance. For instance, using the repetition code of length 2e+1, a decoding error occurs if and only if e+1 or more bits get flipped in the channel. This event has probability

\sum_{k=e+1}^n \binom{n}{k} p^k(1-p)^{n-k} \le p^{e+1}(1-p)^e \sum_{k=e+1}^n \binom{n}{k} \le p^e (1-p)^e 2^{n-1}  \le \rho^e

where \rho = 4p(1-p) < 1. Hence we can communicate with decoding error probability 1 + 2 \log \epsilon / \log \rho. For example, if p = 1/4 then \rho = 3/4 and \rho^e < 5\% provided n \ge 23. With this solution we send 23 bits through the channel for each bit of the message. The information rate is therefore 1/23. Can we do better?

Shannon’s Noisy Coding Theorem

The capacity of the BSC is 1-h(p), where

h(p) = -p \log_2 p -(1-p) \log_2(1-p)

is Shannon’s entropy function. Capacity has a slightly technical definition in terms of mutual information. But by Shannon’s Noisy Coding Theorem it has the following interpretation: given any \epsilon > 0 and any \eta > 0, for all sufficiently large n there exists a binary code C of size > 2^{n (1- h(p) - \eta)} such that the probability of a decoding error when C is used to communicate on the BSC is < \epsilon. For example, if p = 1/4 then

0.188 < 1-h(p) \approx 0.189.

Therefore Shannon's Noisy Coding Theorem promises that, by taking n sufficiently large, we can communicate with negligible decoding error probability and information rate 0.188.

As far as I know, the only proofs of Shannon's Noisy Coding Theorem use the probabilistic method.

Outline proof

Let C = \{u(1),\ldots, u(N)\} be a binary code of length n and size about 2^{n (1- h(p) - \eta/2)}, obtained by choosing each codeword u(i) uniformly and independently at random from \mathbb{F}_2^n. Let p_{i}(C) be the probability (in the BSC model) that if u(i) is sent then the received word is wrongly decoded as some u(j) with j\not=i. An expectation argument shows that, provided n is sufficiently large,

\mathbb{E}_{\mathrm{code}}[p_{i}(C)] < \epsilon/2.

Note that here the expectation is taken with respect to the random choice of C and p_{i} is a random variable (that just happens to be a probability in the BSC model). We'll see shortly why \epsilon/2 is the right upper bound to take. It follows by averaging over all codewords that

\frac{1}{N}\sum_{i=1}^N\mathbb{E}_\mathrm{code}[p_i(C)] < \epsilon/2.

By the probabilistic method, there exists a particular code $C$ such that

\frac{1}{N}\sum_{i=1}^N p_i(C) < \epsilon/2.

Does this mean that we can use C to communicate with decoding error probability < \epsilon/2? Unfortunately not, because the previous equation only says that on average, we meet this bound. There could be some codewords u(i), perhaps those unusually near to many other codewords, for which the decoding error probability is higher. But since the average is \epsilon/2, at most half the codewords have p_i(C) \ge \epsilon. Therefore by removing at most half the codewords, we get a code C' of size N' \ge N/2 with decoding error probability < \epsilon. It might seem bad that, at worst, |C'| = |C|/2 = N/2, but since N = 2^{(1-h(p)-\eta/2)n}, we have

\frac{N}{2} = 2^{(1-h(p) -\eta/2 - 1/n))n}

and provided n is sufficiently large, \eta + 1/n < \eta. This randomly constructed C' proves Shannon's Noisy Coding Theorem for the parameters \epsilon and \eta.

Remarks

  1. The final expurgation step, and the need to work with \epsilon/2 rather than \epsilon can be avoided by choosing a random linear code. Then the multiset of Hamming distances \{ d(u(i),u(j)) : j\not=i \} which determines the probability p_i(C) of a decoding error on sending u(i) is the same for all i. Therefore all the p_i(C) are equal. Some linear codes will be uniformly terrible, but by the probabilistic method, some linear codes are uniformly good.
  2. Care is needed to turn the outline above into a complete proof. For example, in the first step bounding the p_i(C), one needs Chernoff’s bound or the Central Limit Theorem to justify that when a codeword u is sent, the received word v is, with probability tending to 1 with n, of distance between (p-\delta)n and (p+\delta)n of u, for any fixed \delta > 0.
  3. One subtle feature of the proof is that there are two different probability models in play. I tried to emphasise this above, for example by using \epsilon for the error bounds in the channel probability space, and \eta for the error bounds in the code probability space. A related feature is that we have to work with expectations, rather than probabilities, in the code’s probability space. For instance, it would be nice to have

    \mathbb{P}_\mathrm{code}\bigl[\mathbb{P}_\mathrm{channel}[A_i] > \epsilon\bigr] < \eta

    for sufficiently large n, where A_i is the event that a sent u(i) is wrongly decoded as some u(j) with j\not= i. But these probabilities seem hard to reason about directly.

A toy model

Here is a toy model that has the same split into two probability spaces, but is, I think, much easier to understand. To play Match! using a supplied deck of N cards numbered from 1 up to n, the two players each draw a card from the deck. If they find they have matching numbers, they both eat some cake. Let M be this ‘in game’ event. Otherwise they go to bed hungry.

We suppose that the manufacturer constructs decks by choosing each card uniformly and independent at random. Let f_r be the frequency of card r, for each r \in \{1,\ldots, n\}. We have

\mathbb{P}_\mathrm{in game}[M] = \sum_{r=1}^n \frac{f_r(f_r-1)}{N(N-1)}.

In the random process used by the manufacturer to construct decks, f_r has the binomial distribution B(1/n,N). Hence \mathbb{E}_\mathrm{deck}[f_r] = N/n and

\mathbb{E}_\mathrm{deck}[f_r^2] = \mathrm{Var}_\mathrm{deck}[f_r] + \mathbb{E}_\mathrm{deck}[f_r]^2 = \frac{N(n-1)}{n^2} + \frac{N^2}{n^2}.

It follows that

\begin{aligned}\mathbb{E}_\mathrm{deck}\bigl[\mathbb{P}_\mathrm{in game}[M]\bigr] &= \frac{1}{N(N-1)}\sum_{i=1}^r \bigl( \frac{N(n-1)}{n^2} + \frac{N^2}{n^2} - \frac{N}{n} \bigr) \\ &= \frac{n}{N(N-1)n^2} \bigl( Nn- N + N^2 - Nn \bigr) \\ &= \frac{1}{n}. \end{aligned}

This can also be seen without calculation: mixing up the two probability models, we can imagine that the manufacturer builds a two card deck, and issues the cards to the players. The probability they are equal is clearly 1/m.

We can also reason about probabilities in the deck model. For example \mathbb{P}_\mathrm{deck}\bigl[ \mathbb{P}_\mathrm{in game}[M] = 0 \bigr] = 0 is the probability that all cards are different, namely

\frac{n(n-1) \ldots (n-N+1)}{n^N} = \frac{N!}{n^N} \binom{n}{N}.