Shannon’s 1949 paper
Wildon's Weblog 2018-03-12
In 1949 Claude Shannon published Communication theory of secrecy systems, Bell Systems Technical Journal 28 656–715. The members of Bell Labs around this time included Richard Hamming, the three inventors of the transistor and Harry Nyquist, of the Nyquist bound. Its technical journal published many seminal papers, including Shannon’s 1948 paper A mathematical theory of communication defining entropy and Hamming’s 1950 paper Error detecting and error correcting codes, defining Hamming distance and essentially inventing the modern `adversarial’ setting for coding theory.
Incidentally, the story goes that Shannon asked Von Neumann what he should call his new measure of information content and Von Neumann replied
‘You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, no one really knows what entropy really is, so in a debate you will always have the advantage.‘
Like Hamming’s paper, Shannon’s two papers still amply repay reading today. One idea introduced in the 1949 paper is ‘perfect secrecy’.
Perfect secrecy
Consider a cryptosystem with plaintexts and ciphertexts , with encryption functions parametrised by a set of keys.
Shannon supposes that there is a probability distribution on the plaintexts, assigning an a priori probability to each . There is also an independent probability distribution on the keys. These interact in a rich way. For instance, is a sum of key probabilities, while is either zero, or the probability attached to the unique plaintext . Other probabilities, such as can be computed from these using Bayes’ Theorem type arguments.
Suppose we observe a ciphertext : what, if anything do we learn about the corresponding plaintext ? Shannon defines the a posteriori probability to be the conditional probability of the plaintext given we observed the ciphertext ; the system then has perfect secrecy if, for any a priori probability distribution , we have for all , and all . (This assumes implicitly that for every .)
Shannon proves in his Theorem 6 that a necessary and sufficient condition for perfect secrecy is that for all and .
The proof is a short application of Bayes’ Theorem: since , and since we may choose (which is necessary anyway for the conditional probability to be well-defined), we have if and only if .
Corollary. In a system with perfect secrecy, . Moreover, if equality holds then every key must be used with equal probability and for each and there exist a unique such that .
Proof Fix a plaintext . We claim that for each there exists a key such that . Indeed, if is never an encryption of then, for any choice of a priori probabilities that gives some probability to , we have , so by Shannon’s Theorem 6, . But, by the implicit assumption, there is a non-zero chance of observing , a contradiction.
Since has at most different encryptions, the claim implies that . Moreover if equality holds then for every there exists a unique such that . Since , where is this unique key, the conclusion of Theorem 6, that is constant for , then implies that each key is equiprobable.
The ‘only if’ direction of Theorem 2.4 in Cryptography: theory and practice (3rd edition) by Douglas Stinson, is the corollary above, but, according to my reading of pages 48 and 50, interpreted with a different definition of perfect secrecy, in which the a priori distribution is fixed, as part of the cryptosystem. (This makes some sense, for example, in a cryptosystem encrypting English plaintexts nonsensical strings should always have zero probability.) Unfortunately, Stinson’s change makes the result false. The diagram below shows a toy cryptosystem with two keys and two plaintexts.
Take and . Then
and
so the system has perfect secrecy, no matter what probabilities we assign to the keys. (Incidentally, setting and gives a cryptosystem where we always send and observe the ciphertext ; the a posteriori probability of is therefore the same as the a priori probability, so the system has perfect secrecy, even though the keys are not equiprobable. This shows that Shannon’s implicit assumption is not just a technicality required to make the conditional probabilities well-defined.)
The error in the proof of Theorem 2.4 comes in the application of Bayes’ Law, where it is implicitly assumed that . This shows that the extra layer of quantification in Shannon’s paper is not a mere technicality. Given the difficulties students have with nested quantifiers, I’m inclined to keep Stinson’s definition, and fix the problem by assuming for each . One can get a more general, but somewhat technical result, by taking an arbitrary cryptosystem and applying Shannon’s Theorem to the subsystem where is the set of plaintexts with positive probability, and and are unchanged. (To be fair to Stinson, he observes before the theorem that plaintexts such that are never an obstacle to perfect secrecy, so clearly he was aware of the issue. He also assumes, as we have done, that for each , but this is something else.)
Incidentally, there is a subtle English trap in Shannon’s paper: he says, quite correctly, that when , ‘it is possible to obtain perfect secrecy with only this number of keys’. Here `with only’ does not mean the same as ‘only with’.
Example from permutation groups
Given a finite group acting transitively on a set we obtain a cryptosystem with plaintexts and ciphertexts and encryption maps indexed by the elements of . For which probability distributions on does this cryptoscheme have perfect secrecy?
Let denote the point stabiliser of . Since
where is any element such that , the cryptosystem has perfect secrecy if and only if, for each ,
is constant as varies over . Call this condition .
Lemma 2. Suppose that has a regular normal subgroup . Any probability distribution constant on the cosets of satisfies .
Proof. Since , the subgroup meets each coset of in a unique element. The same holds for each coset . Therefore
is independent of .
In the case of the affine cipher, the converse also holds.
Theorem 3. Let be a prime and let be the affine group on with normal translation subgroup . A probability distribution on satisfies if and only if it is constant on each of the cosets of .
Proof. Let generate the multiplicative group of integers modulo . Then is a semidirect product where is the regular subgroup generated by translation by and the point stabiliser is generated by multiplication by .
A probability distribution on can be regarded as an element of the group algebra . The probability distributions satisfying correspond to those such that, for each ,
is constant as varies over . Thus if is the corresponding subspace of , then is invariant under left-multiplication by and right-conjugation by . But since for , taken together, these conditions are equivalent to invariance under left- and right- multiplication by . It is therefore natural to ask for the decomposition of as a representation of , where acts by
Since is abelian, for all . Hence . The calculation behind this observation is
where in . This shows that is stabilised by and so , regarded as a representation of , is induced from the subgroup of generated by . By Frobenius Reciprocity, decomposes as a direct sum of a trivial representation of , spanned by
and a further non-trivial representations, each with kernel . Critically all of these representations are non-isomorphic.
By Lemma 2, contains all trivial representations. Writing for their direct sum, we have for a unique complement that decomposes uniquely as a sum of the non-trivial representations in the previous paragraph. Therefore if properly contains then contains a non-trivial summand of some , spanned by some of the form
.
The support of is , which meets each point stabiliser in a unique element. Hence all the must be equal, a contradiction.
This should generalize to an arbitrary Frobenius group with abelian kernel, and probably to other permutation groups having a regular normal subgroup.
The key equivocation given a ciphertext
In practice, we might care more about what an observed ciphertext tells us about the key. This question is considered in Part B of Shannon’s paper. Let , and be random variables representing the plaintext, ciphertext and key, respectively. We assume that and are independent. (This is easily motivated: for instance, it holds if the key is chosen before the plaintext, and the plaintext is not a message about the key.) The conditional entropy , defined by
represents our expected uncertainty, or `equivocation’ to use Shannon’s term, in the key, after we have observed a ciphertext. (Throughout denotes logarithm in base .)
It is intuitive that : since a given plaintext can be encrypted to a given ciphertext by several different keys, given a ciphertext, we are at least as uncertain about the key as we are about the plaintext. The proof is yet another exercise in Bayes’ Theorem, which I’ll include because it sheds some light on the probability model.
Lemma 4. .
Proof. Fix . Let be the set of plaintext that encrypt, under some key, to . (Thus is the union of the inverse images of the singleton set under all encryption maps.) For , let be the set (possibly empty) of keys such that . By Bayes’ Theorem,
Similarly
if for some (necessarily unique) and otherwise this probability is . Thus
and so the probability distribution on (conditioned on ) is a coarsening of the probability distribution on (conditioned on ). Therefore and the lemma now follows by summing over , using the definition of conditional entropy above.
A related formula is now a textbook staple. If we know the key then we know if and only if we know . Therefore the joint entropies and agree and
by the independence assumption. Hence
Shannon considers a toy cryptosystem with keys and , in which a plaintext is encrypted bit-by-bit, as itself if and flipping each bit if . Suppose that the keys are equiprobable, and that each plaintext bit is independently with probability , so the probability of a plaintext having exactly zeros is . Denoting this quantity by we have . The probability that the ciphertext has exactly zeros is . Therefore
and so
where .
The graph on page 690 of Shannon’s paper shows against for two values of . He does not attempt to analyse the formula any further, remarking `yet already the formulas are so involved as to be nearly useless’. I’m not convinced this is the case, but the analysis is certainly fiddly.
When Shannon’s formula gives , as expected. We may therefore reduce to the case when and so ). Take . The summand for is
Since , the ratio of the summands for and is
.
When we can use the inequalities
for to bound the ratio below by
Therefore the summands in Shannon’s formula get exponentially small, at a rate approximately as increases from . Their contribution can be bounded by summing a geometric series and the same order as the middle term. (This does not require the contribution from the binomial coefficient, which is initially small, but eventually dominates.)
Going the other way, the summands for and is
which can be rewritten as
It is useful to set . If the final fraction were then the maximum would occur when , i.e. when . Numerical tests suggest this gives about the right location of the maximum when large (and so the first fraction can be neglected) and near to . For of the same order of the first fraction is dominant and gives exponential decay. It is therefore not so surprising that the middle term gives a reasonable lower bound for the entire series, namely
for some constant . The graph below shows and the middle summand for with colours red, green, blue, black.
It seems possible that the lower bound is, up to a multiplicative constant, also an upper bound. However the argument above will show at most that .
Upper bound
A slightly weaker upper bound follows from tail estimates for binomial probabilities. (Update. There might be a stronger result using the Central Limit Theorem.) Let be distributed as , so . By Hoeffding’s inequality,
The argument by exponential decay shows that the contribution to from the summands for is of the same order as the middle term. Using we get
as an upper bound for the remaining terms. By a standard trick, related to the formula for the expectation of a random variable taking values in , we have
Take in the version of Hoeffding’s inequality to get
Thus the upper bounds for become exponentially smaller as we decrease from by as much as , for any . By summing a geometric series, as before, we get
for some constant . The neglected contributions are of the order of the middle term, so bounded above by . Therefore
for some further constant .
Random ciphers and unicity distance
A selection effect
Imagine a hypothetical world where families have either no children, an only child, or two twins, with probabilities , , . The mean number of children per family is therefore . In an attempt to confirm this empirically, a researcher goes to a primary school and asks each child to state his or her number of siblings. Twin-families are half as frequent as only-families, but send two representatives to the school rather than one: these effects cancel, so the researcher observes equal numbers of children reporting and siblings. (Families with no children are, of course, never sampled.) The estimate for the mean number of children is therefore the inflated .
The random cipher
Shannon’s proof of the Noisy Coding Theorem for the memoryless binary channel is a mathematical gem; his chief insight was that a random binary code of suitable size can (with high probability) be used as part of a coding scheme that achieves the channel capacity. In his 1949 paper he considers the analogous random cipher.
Let be the Roman alphabet. Let and let and be the random variables recording the plaintext and ciphertext, respectively. We suppose that the message is chosen uniformly at random from those plaintexts that make good sense in English (once spaces are inserted). In yet another fascinating paper Shannon estimated that the per-character redundancy of English, say, is between and bits. Thus and the number of plausible plaintexts is . Let .
Fix . We aim to construct a random cipher with exactly keys. Each key will be chosen with equal probability. A quick reading of (3) on page 691 of Shannon’s paper may suggest a too naive construction: for each ciphertext, we chose plaintexts uniformly at random from to be its decryptions. (Some plaintexts may be chosen multiple times, meaning two or more keys decrypt the ciphertext to the same plaintext.) However if we pick the same first for different ciphertexts and , then it is ambiguous whether to encrypt as or as under the first key. Moreover some plaintexts may never be chosen, giving them no encryption under the first key. Instead, it seems to work best to choose, for each of keys, a random bijection to be the encryption function for that key. This also arrives at the situation required by Shannon in (3).
Let be the number of keys giving plausible decryptions of the ciphertext . (The number of keys may be more than the number of plausible decryptions if there are two keys decrypting to the same plaintext.) Thus is a random quantity, where the randomness comes from the choices made in the construction of the random cipher. If is chosen uniformly at random from then, for each key, there is a probability that the decrypt of under this key is plausible. Therefore
and is distributed binomially as .
However this is not the distribution of : since is the encryption of a plausible plaintext, ciphertexts with a high are more frequent, while, as in the family example, ciphertexts such that are never seen at all.
Lemma 5. .
Proof. Let be the multiset of the plausible plaintexts that are decryptions of . (Alternatively, and maybe more clearly, we could define to be the set of pairs where is a key decrypting to a plausible plaintext .) Conditioning on the event that we have
as required.
Corollary 6. The random variable is distributed as .
Proof. By Lemma 5 and the identity we have
Hence is distributed as .
It feels like there should be a quick direct proof of the corollary, along the lines `we know has one plausible decrypt; each of the remaining is plausible with probability , hence … ‘. But this seems dangerously close to `we know two fair coin flips gave at least one head; the other flip is a head with probability , hence …’, which gives the wrong answer. The difference is that the plausible decrypt of comes with a known key, whereas the `at least one head’ could be either flip. Given the subtle nature of selection effects and conditional probability, I prefer the calculation in the lemma.
Shannon’s paper replaces the lemma with the comment [with one notation change] `The probability of such a cryptogram [our event ] is , since it can be produced by keys from high probability messages [our plausible plaintexts] each with probability .’ I cannot follow this: in particular cannot be a probability, since it is far greater than .
Given that , the entropy in the key is . Therefore, going back to the lemma, we have
Shannon argues that if is large compared to then is almost constant for near the mean of , and so the expected value can be approximated by
Observe that , where is the per-character redundancy of English. Therefore Shannon’s approximation becomes
When is large compared to , Shannon uses a Poisson approximation. As an alternative, we argue from Corollary 5. Let be distributed as . We have
The graph of is therefore as sketched below.
The quantity is known as the unicity distance of the cipher. Roughly one expects that, after observing characters of the ciphertext, the key will be substantially known.
A more general random cipher?
It would be interesting to generalize the random cipher to the case when and we pick random injections.