A toy model for the proof of Shannon’s Noisy Coding Theorem for the BSC
Wildon's Weblog 2019-09-06
In the Binary Symmetric Channel (BSC) with symbol error probability , an input bit flips with probability . A binary code of length can be used to communicate on the BSC by sending each bit of a codeword , collecting the received bits as a word still of length , and then decoding as the nearest codeword to with respect to Hamming distance. For instance, using the repetition code of length , a decoding error occurs if and only if or more bits get flipped in the channel. This event has probability
where . Hence we can communicate with decoding error probability . For example, if then and provided . With this solution we send bits through the channel for each bit of the message. The information rate is therefore . Can we do better?
Shannon’s Noisy Coding Theorem
The capacity of the BSC is , where
is Shannon’s entropy function. Capacity has a slightly technical definition in terms of mutual information. But by Shannon’s Noisy Coding Theorem it has the following interpretation: given any and any , for all sufficiently large there exists a binary code of size such that the probability of a decoding error when is used to communicate on the BSC is . For example, if then
Therefore Shannon's Noisy Coding Theorem promises that, by taking sufficiently large, we can communicate with negligible decoding error probability and information rate .
As far as I know, the only proofs of Shannon's Noisy Coding Theorem use the probabilistic method.
Outline proof
Let be a binary code of length and size about , obtained by choosing each codeword uniformly and independently at random from . Let be the probability (in the BSC model) that if is sent then the received word is wrongly decoded as some with . An expectation argument shows that, provided is sufficiently large,
Note that here the expectation is taken with respect to the random choice of and is a random variable (that just happens to be a probability in the BSC model). We'll see shortly why is the right upper bound to take. It follows by averaging over all codewords that
By the probabilistic method, there exists a particular code $C$ such that
Does this mean that we can use to communicate with decoding error probability ? Unfortunately not, because the previous equation only says that on average, we meet this bound. There could be some codewords , perhaps those unusually near to many other codewords, for which the decoding error probability is higher. But since the average is , at most half the codewords have . Therefore by removing at most half the codewords, we get a code of size with decoding error probability . It might seem bad that, at worst, , but since , we have
and provided is sufficiently large, . This randomly constructed proves Shannon's Noisy Coding Theorem for the parameters and .
Remarks
- The final expurgation step, and the need to work with rather than can be avoided by choosing a random linear code. Then the multiset of Hamming distances which determines the probability of a decoding error on sending is the same for all . Therefore all the are equal. Some linear codes will be uniformly terrible, but by the probabilistic method, some linear codes are uniformly good.
- Care is needed to turn the outline above into a complete proof. For example, in the first step bounding the , one needs Chernoff’s bound or the Central Limit Theorem to justify that when a codeword is sent, the received word is, with probability tending to with , of distance between and of , for any fixed .
- One subtle feature of the proof is that there are two different probability models in play. I tried to emphasise this above, for example by using for the error bounds in the channel probability space, and for the error bounds in the code probability space. A related feature is that we have to work with expectations, rather than probabilities, in the code’s probability space. For instance, it would be nice to have
for sufficiently large , where is the event that a sent is wrongly decoded as some with . But these probabilities seem hard to reason about directly.
A toy model
Here is a toy model that has the same split into two probability spaces, but is, I think, much easier to understand. To play Match! using a supplied deck of cards numbered from up to , the two players each draw a card from the deck. If they find they have matching numbers, they both eat some cake. Let be this ‘in game’ event. Otherwise they go to bed hungry.
We suppose that the manufacturer constructs decks by choosing each card uniformly and independent at random. Let be the frequency of card , for each . We have
In the random process used by the manufacturer to construct decks, has the binomial distribution . Hence and
It follows that
This can also be seen without calculation: mixing up the two probability models, we can imagine that the manufacturer builds a two card deck, and issues the cards to the players. The probability they are equal is clearly .
We can also reason about probabilities in the deck model. For example is the probability that all cards are different, namely