A toy model for the proof of Shannon’s Noisy Coding Theorem for the BSC

Wildon's Weblog 2019-09-06

In the Binary Symmetric Channel (BSC) with symbol error probability $p$ , an input bit flips with probability $p < 1/2$ . A binary code of length $n$ can be used to communicate on the BSC by sending each bit of a codeword $u$ , collecting the $n$ received bits as a word $v$ still of length $n$ , and then decoding $v$ as the nearest codeword $u'$ to $v$ with respect to Hamming distance. For instance, using the repetition code of length $2e+1$ , a decoding error occurs if and only if $e+1$ or more bits get flipped in the channel. This event has probability

$\sum_{k=e+1}^n \binom{n}{k} p^k(1-p)^{n-k} \le p^{e+1}(1-p)^e \sum_{k=e+1}^n \binom{n}{k} \le p^e (1-p)^e 2^{n-1} \le \rho^e$

where $\rho = 4p(1-p) < 1$ . Hence we can communicate with decoding error probability $1 + 2 \log \epsilon / \log \rho$ . For example, if $p = 1/4$ then $\rho = 3/4$ and $\rho^e < 5\%$ provided $n \ge 23$ . With this solution we send $23$ bits through the channel for each bit of the message. The information rate is therefore $1/23$ . Can we do better?

Shannon’s Noisy Coding Theorem

The capacity of the BSC is $1-h(p)$ , where

$h(p) = -p \log_2 p -(1-p) \log_2(1-p)$

is Shannon’s entropy function. Capacity has a slightly technical definition in terms of mutual information. But by Shannon’s Noisy Coding Theorem it has the following interpretation: given any $\epsilon > 0$ and any $\eta > 0$ , for all sufficiently large $n$ there exists a binary code $C$ of size $> 2^{n (1- h(p) - \eta)}$ such that the probability of a decoding error when $C$ is used to communicate on the BSC is $< \epsilon$ . For example, if $p = 1/4$ then

$0.188 < 1-h(p) \approx 0.189.$

Therefore Shannon's Noisy Coding Theorem promises that, by taking $n$ sufficiently large, we can communicate with negligible decoding error probability and information rate $0.188$ .

As far as I know, the only proofs of Shannon's Noisy Coding Theorem use the probabilistic method.

Outline proof

Let $C = \{u(1),\ldots, u(N)\}$ be a binary code of length $n$ and size about $2^{n (1- h(p) - \eta/2)}$ , obtained by choosing each codeword $u(i)$ uniformly and independently at random from $\mathbb{F}_2^n$ . Let $p_{i}(C)$ be the probability (in the BSC model) that if $u(i)$ is sent then the received word is wrongly decoded as some $u(j)$ with $j\not=i$ . An expectation argument shows that, provided $n$ is sufficiently large,

$\mathbb{E}_{\mathrm{code}}[p_{i}(C)] < \epsilon/2.$

Note that here the expectation is taken with respect to the random choice of $C$ and $p_{i}$ is a random variable (that just happens to be a probability in the BSC model). We'll see shortly why $\epsilon/2$ is the right upper bound to take. It follows by averaging over all codewords that

$\frac{1}{N}\sum_{i=1}^N\mathbb{E}_\mathrm{code}[p_i(C)] < \epsilon/2.$

By the probabilistic method, there exists a particular code $C$ such that

$\frac{1}{N}\sum_{i=1}^N p_i(C) < \epsilon/2.$

Does this mean that we can use $C$ to communicate with decoding error probability $< \epsilon/2$ ? Unfortunately not, because the previous equation only says that on average, we meet this bound. There could be some codewords $u(i)$ , perhaps those unusually near to many other codewords, for which the decoding error probability is higher. But since the average is $\epsilon/2$ , at most half the codewords have $p_i(C) \ge \epsilon$ . Therefore by removing at most half the codewords, we get a code $C'$ of size $N' \ge N/2$ with decoding error probability $< \epsilon$ . It might seem bad that, at worst, $|C'| = |C|/2 = N/2$ , but since $N = 2^{(1-h(p)-\eta/2)n}$ , we have

$\frac{N}{2} = 2^{(1-h(p) -\eta/2 - 1/n))n}$

and provided $n$ is sufficiently large, $\eta + 1/n < \eta$ . This randomly constructed $C'$ proves Shannon's Noisy Coding Theorem for the parameters $\epsilon$ and $\eta$ .

Remarks

The final expurgation step, and the need to work with $\epsilon/2$ rather than $\epsilon$ can be avoided by choosing a random linear code. Then the multiset of Hamming distances $\{ d(u(i),u(j)) : j\not=i \}$ which determines the probability $p_i(C)$ of a decoding error on sending $u(i)$ is the same for all $i$ . Therefore all the $p_i(C)$ are equal. Some linear codes will be uniformly terrible, but by the probabilistic method, some linear codes are uniformly good.
Care is needed to turn the outline above into a complete proof. For example, in the first step bounding the $p_i(C)$ , one needs Chernoff’s bound or the Central Limit Theorem to justify that when a codeword $u$ is sent, the received word $v$ is, with probability tending to $1$ with $n$ , of distance between $(p-\delta)n$ and $(p+\delta)n$ of $u$ , for any fixed $\delta > 0$ .
One subtle feature of the proof is that there are two different probability models in play. I tried to emphasise this above, for example by using $\epsilon$ for the error bounds in the channel probability space, and $\eta$ for the error bounds in the code probability space. A related feature is that we have to work with expectations, rather than probabilities, in the code’s probability space. For instance, it would be nice to have
$\mathbb{P}_\mathrm{code}\bigl[\mathbb{P}_\mathrm{channel}[A_i] > \epsilon\bigr] < \eta$

for sufficiently large $n$ , where $A_i$ is the event that a sent $u(i)$ is wrongly decoded as some $u(j)$ with $j\not= i$ . But these probabilities seem hard to reason about directly.

A toy model

Here is a toy model that has the same split into two probability spaces, but is, I think, much easier to understand. To play Match! using a supplied deck of $N$ cards numbered from $1$ up to $n$ , the two players each draw a card from the deck. If they find they have matching numbers, they both eat some cake. Let $M$ be this ‘in game’ event. Otherwise they go to bed hungry.

We suppose that the manufacturer constructs decks by choosing each card uniformly and independent at random. Let $f_r$ be the frequency of card $r$ , for each $r \in \{1,\ldots, n\}$ . We have

$\mathbb{P}_\mathrm{in game}[M] = \sum_{r=1}^n \frac{f_r(f_r-1)}{N(N-1)}.$

In the random process used by the manufacturer to construct decks, $f_r$ has the binomial distribution $B(1/n,N)$ . Hence $\mathbb{E}_\mathrm{deck}[f_r] = N/n$ and

$\mathbb{E}_\mathrm{deck}[f_r^2] = \mathrm{Var}_\mathrm{deck}[f_r] + \mathbb{E}_\mathrm{deck}[f_r]^2 = \frac{N(n-1)}{n^2} + \frac{N^2}{n^2}.$

It follows that

$\begin{aligned}\mathbb{E}_\mathrm{deck}\bigl[\mathbb{P}_\mathrm{in game}[M]\bigr] &= \frac{1}{N(N-1)}\sum_{i=1}^r \bigl( \frac{N(n-1)}{n^2} + \frac{N^2}{n^2} - \frac{N}{n} \bigr) \\ &= \frac{n}{N(N-1)n^2} \bigl( Nn- N + N^2 - Nn \bigr) \\ &= \frac{1}{n}. \end{aligned}$

This can also be seen without calculation: mixing up the two probability models, we can imagine that the manufacturer builds a two card deck, and issues the cards to the players. The probability they are equal is clearly $1/m$ .

We can also reason about probabilities in the deck model. For example $\mathbb{P}_\mathrm{deck}\bigl[ \mathbb{P}_\mathrm{in game}[M] = 0 \bigr] = 0$ is the probability that all cards are different, namely

$\frac{n(n-1) \ldots (n-N+1)}{n^N} = \frac{N!}{n^N} \binom{n}{N}.$