Shannon’s 1949 paper

Wildon's Weblog 2018-03-12

In 1949 Claude Shannon published Communication theory of secrecy systems, Bell Systems Technical Journal 28 656–715. The members of Bell Labs around this time included Richard Hamming, the three inventors of the transistor and Harry Nyquist, of the Nyquist bound. Its technical journal published many seminal papers, including Shannon’s 1948 paper A mathematical theory of communication defining entropy and Hamming’s 1950 paper Error detecting and error correcting codes, defining Hamming distance and essentially inventing the modern `adversarial’ setting for coding theory.

Incidentally, the story goes that Shannon asked Von Neumann what he should call his new measure of information content and Von Neumann replied

You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, no one really knows what entropy really is, so in a debate you will always have the advantage.

Like Hamming’s paper, Shannon’s two papers still amply repay reading today. One idea introduced in the 1949 paper is ‘perfect secrecy’.

Perfect secrecy

Consider a cryptosystem with plaintexts \mathcal{P} and ciphertexts \mathcal{C}, with encryption functions e_k : \mathcal{P} \rightarrow \mathcal{C} parametrised by a set \mathcal{K} of keys.

Shannon supposes that there is a probability distribution on the plaintexts, assigning an a priori probability p_x to each x \in \mathcal{P}. There is also an independent probability distribution on the keys. These interact in a rich way. For instance, \mathbb{P}[Y=y|X=x] = \mathbb{P}[e_K(x) = y] is a sum of key probabilities, while \mathbb{P}[Y=y|K=k] = \mathbb{P}[e_k(X) = y] is either zero, or the probability attached to the unique plaintext p_{e_k^{-1}(y)}. Other probabilities, such as \mathbb{P}[X=x|Y=y] can be computed from these using Bayes’ Theorem type arguments.

Suppose we observe a ciphertext y \in \mathcal{C}: what, if anything do we learn about the corresponding plaintext x \in \mathcal{P}? Shannon defines the a posteriori probability \mathbb{P}[X = x | Y = y] to be the conditional probability of the plaintext x \in \mathcal{P} given we observed the ciphertext y \in \mathcal{C}); the system then has perfect secrecy if, for any a priori probability distribution p_x, we have \mathbb{P}[X = x | Y = y] = p_x for all x \in \mathcal{P}, and all y \in \mathcal{C}. (This assumes implicitly that \mathbb{P}[Y = y] > 0 for every y \in \mathcal{C}.)

Shannon proves in his Theorem 6 that a necessary and sufficient condition for perfect secrecy is that \mathbb{P}[Y = y|X = x] = \mathbb{P}[Y = y] for all x \in \mathcal{P} and y \in \mathcal{C}.

The proof is a short application of Bayes’ Theorem: since \mathbb{P}[X = x | Y = y] = \mathbb{P}[Y = y|X = x]p_x/\mathbb{P}[Y = y], and since we may choose p_x \not=0 (which is necessary anyway for the conditional probability to be well-defined), we have p_x = \mathbb{P}[X = x | Y = y] if and only if \mathbb{P}[Y = y|X = x] = \mathbb{P}[Y = y].

Corollary. In a system with perfect secrecy, |\mathcal{K}| \ge |\mathcal{C}|. Moreover, if equality holds then every key must be used with equal probability 1/|\mathcal{K}| and for each x \in \mathcal{P} and y \in \mathcal{C} there exist a unique k \in \mathcal{K} such that e_k(x) = y.

Proof Fix a plaintext x \in \mathcal{P}. We claim that for each y \in \mathcal{C} there exists a key k such that e_k(x) = y. Indeed, if y_{\mathrm{bad}} \in \mathcal{C} is never an encryption of x then, for any choice of a priori probabilities that gives some probability to x, we have \mathbb{P}[Y = y_{\mathrm{bad}}|X = x] = 0, so by Shannon’s Theorem 6, \mathbb{P}[Y = y_\mathrm{bad}] = 0. But, by the implicit assumption, there is a non-zero chance of observing y_\mathrm{bad}, a contradiction.

Since x has at most |\mathcal{K}| different encryptions, the claim implies that |\mathcal{K}| \ge |\mathcal{C}|. Moreover if equality holds then for every y \in \mathcal{C} there exists a unique k \in \mathcal{K} such that e_k(x) = y. Since \mathbb{P}[Y=y | X =x ] = \mathbb{P}[K = k], where k is this unique key, the conclusion of Theorem 6, that \mathbb{P}[Y = y|X = x] is constant for y \in \mathcal{C}, then implies that each key is equiprobable. \Box

The ‘only if’ direction of Theorem 2.4 in Cryptography: theory and practice (3rd edition) by Douglas Stinson, is the corollary above, but, according to my reading of pages 48 and 50, interpreted with a different definition of perfect secrecy, in which the a priori distribution is fixed, as part of the cryptosystem. (This makes some sense, for example, in a cryptosystem encrypting English plaintexts nonsensical strings should always have zero probability.) Unfortunately, Stinson’s change makes the result false. The diagram below shows a toy cryptosystem with two keys and two plaintexts.

Take p_w = 0 and p_x = 1. Then

\mathbb{P}[X = x | Y = y] = \mathbb{P}[X = x | Y = y'] = p_x = 1

and

\mathbb{P}[X = w | Y = y] = \mathbb{P}[X = w | Y = y'] = p_w = 0,

so the system has perfect secrecy, no matter what probabilities we assign to the keys. (Incidentally, setting p_{x} = 1 and \mathbb{P}[K = k] = 1 gives a cryptosystem where we always send x and observe the ciphertext y; the a posteriori probability of x is therefore the same as the a priori probability, so the system has perfect secrecy, even though the keys are not equiprobable. This shows that Shannon’s implicit assumption is not just a technicality required to make the conditional probabilities well-defined.)

The error in the proof of Theorem 2.4 comes in the application of Bayes’ Law, where it is implicitly assumed that p_x \not= 0. This shows that the extra layer of quantification in Shannon’s paper is not a mere technicality. Given the difficulties students have with nested quantifiers, I’m inclined to keep Stinson’s definition, and fix the problem by assuming p_x\not =0 for each x \in \mathcal{P}. One can get a more general, but somewhat technical result, by taking an arbitrary cryptosystem and applying Shannon’s Theorem to the subsystem where \mathcal{P} is the set of plaintexts with positive probability, and \mathcal{C} and \mathcal{K} are unchanged. (To be fair to Stinson, he observes before the theorem that plaintexts x such that p_x = 0 are never an obstacle to perfect secrecy, so clearly he was aware of the issue. He also assumes, as we have done, that \mathbb{P}[Y = y] \not=0 for each y \in \mathcal{C}, but this is something else.)

Incidentally, there is a subtle English trap in Shannon’s paper: he says, quite correctly, that when |\mathcal{K}| = |\mathcal{C}|, ‘it is possible to obtain perfect secrecy with only this number of keys’. Here `with only’ does not mean the same as ‘only with’.

Example from permutation groups

Given a finite group G acting transitively on a set \Omega we obtain a cryptosystem with plaintexts and ciphertexts \Omega and encryption maps \alpha \mapsto \alpha g indexed by the elements of G. For which probability distributions on G does this cryptoscheme have perfect secrecy?

Let H_\beta denote the point stabiliser of \beta \in \Omega. Since

\mathbb{P}[X = \alpha | Y = \beta] = \mathbb{P}[k \in g_{\alpha \beta} H_\beta],

where g_{\alpha \beta} is any element such that \alpha g_{\alpha\beta}= \beta, the cryptosystem has perfect secrecy if and only if, for each \beta \in \Omega,

\mathbb{P}[K \in g_{\alpha\beta} H_\beta]

is constant as \alpha varies over \Omega. Call this condition (\star).

Lemma 2. Suppose that G has a regular normal subgroup N. Any probability distribution constant on the cosets of N satisfies (\star).

Proof. Since G = N \rtimes H_\beta, the subgroup H_\beta meets each coset of N in a unique element. The same holds for each coset t H_\beta, \ldots, t^{p-1} H_\beta. Therefore

\sum_{k \in g_{\alpha \beta} H_\beta} \mathbb{P}[k] = \sum_{k \in H_\beta} \mathbb{P}[k]

is independent of \alpha. \Box

In the case of the affine cipher, the converse also holds.

Theorem 3. Let p be a prime and let G be the affine group on \mathbb{F}_p with normal translation subgroup N. A probability distribution on G satisfies (\star) if and only if it is constant on each of the cosets of N.

Proof. Let r generate the multiplicative group of integers modulo p. Then G is a semidirect product N \rtimes H where N = \langle t \rangle is the regular subgroup generated by translation by +1 and the point stabiliser H = \langle h \rangle is generated by multiplication by r.

A probability distribution on G can be regarded as an element of the group algebra \mathbb{C}G. The probability distributions satisfying (\star) correspond to those \sum_{k \in G} c_k k \in \mathbb{C}G such that, for each \beta \in \mathbb{F}_p,

\sum_{k \in t^{-\alpha + \beta} H^{t^\beta}} c_k

is constant as \alpha varies over \mathbb{F}_p. Thus if U is the corresponding subspace of \mathbb{C}G, then U is invariant under left-multiplication by N and right-conjugation by N. But since kg = gk^t for k, g \in G, taken together, these conditions are equivalent to invariance under left- and right- multiplication by N. It is therefore natural to ask for the decomposition of \mathbb{C}G as a representation of N \times N, where N \times N acts by

k \cdot (g,g') = g^{-1}k g'.

Since G/N is abelian, (Nh^\beta)t = (Nh^\beta)^t = Nh^\beta for all \beta. Hence \mathbb{C}G = \bigoplus_{\beta = 0}^{p-1} Nh^\beta. The calculation behind this observation is

h^\beta t = h^\beta t h^{-\beta} h^\beta = t^{s^\beta} h^\beta

where s = r^{-1} in \mathbb{F}_p. This shows that h^\beta \in Nh^\beta is stabilised by (t^{-s^\beta}, t) \in N \times N and so Nh^\beta, regarded as a representation of N \times N, is induced from the subgroup of N \times N generated by (t^{-s^\beta}, t). By Frobenius Reciprocity, Nh^\beta decomposes as a direct sum of a trivial representation of N \times N, spanned by

h^\beta + th^\beta + \cdots + t^{p-1}h^\beta,

and a further p-1 non-trivial representations, each with kernel \langle (t^{-s^\beta}, t) \rangle. Critically all (p-1)^2 of these representations are non-isomorphic.

By Lemma 2, U contains all p-1 trivial representations. Writing T for their direct sum, we have \mathbb{C}G = T \oplus C for a unique complement C that decomposes uniquely as a sum of the non-trivial representations in the previous paragraph. Therefore if U properly contains T then U contains a non-trivial summand of some Nh^\beta, spanned by some u \in \mathbb{C}G of the form

b_0 h^\beta + b_1 t h^\beta + \cdots + b_{p-1} t^{p-1} h^\beta .

The support of u is h^\beta N, which meets each point stabiliser in a unique element. Hence all the b_i must be equal, a contradiction. \Box

This should generalize to an arbitrary Frobenius group with abelian kernel, and probably to other permutation groups having a regular normal subgroup.

The key equivocation given a ciphertext

In practice, we might care more about what an observed ciphertext tells us about the key. This question is considered in Part B of Shannon’s paper. Let X, Y and K be random variables representing the plaintext, ciphertext and key, respectively. We assume that X and K are independent. (This is easily motivated: for instance, it holds if the key is chosen before the plaintext, and the plaintext is not a message about the key.) The conditional entropy H(K | Y), defined by

\begin{aligned}H(K{}&{} | Y) \\ &= \sum_{y \in \mathcal{C}} \mathbb{P}[Y=y] H(K | Y = y) \\ &= \sum_{y \in \mathcal{C}}  \mathbb{P}[Y=y] \sum_{k \in \mathcal{K}} \mathbb{P}[K = k | Y = y] \log \frac{1}{\mathbb{P}[K = k | Y = y]} \end{aligned}

represents our expected uncertainty, or `equivocation’ to use Shannon’s term, in the key, after we have observed a ciphertext. (Throughout \log denotes logarithm in base 2.)

It is intuitive that H(K | Y) \ge H(X | Y): since a given plaintext x can be encrypted to a given ciphertext y by several different keys, given a ciphertext, we are at least as uncertain about the key as we are about the plaintext. The proof is yet another exercise in Bayes’ Theorem, which I’ll include because it sheds some light on the probability model.

Lemma 4. H(K | Y) \ge H(X | Y).

Proof. Fix y \in \mathcal{C}. Let \mathcal{P}_y be the set of plaintext that encrypt, under some key, to y. (Thus \mathcal{P}_y is the union of the inverse images of the singleton set \{y\} under all encryption maps.) For x \in \mathcal{P}, let \mathcal{K}_{xy} be the set (possibly empty) of keys k such that e_k(x) = y. By Bayes’ Theorem,

\mathbb{P}[X=x|Y=y] = \displaystyle \frac{\mathbb{P}[K \in \mathcal{K}_{xy}] p_x}{\mathbb{P}[Y=y]}.

Similarly

\mathbb{P}[K=k|Y=y] = \displaystyle \frac{\mathbb{P}[K=k] p_x}{\mathbb{P}[Y=y]}

if y = e_k(x) for some (necessarily unique) x \in \mathcal{P} and otherwise this probability is 0. Thus

\mathbb{P}[X=x|Y=y] = \sum_{k \in \mathcal{K}_{xy}} \mathbb{P}[K=k|Y=y]

and so the probability distribution on X (conditioned on Y=y) is a coarsening of the probability distribution on K (conditioned on Y=y). Therefore H(K | Y=y) \ge H(X | Y = y) and the lemma now follows by summing over y, using the definition of conditional entropy above. \Box

A related formula is now a textbook staple. If we know the key K then we know X if and only if we know Y. Therefore the joint entropies H(K, X) and H(K, Y) agree and

H(K | Y) + H(Y) = H(K,Y) = H(K,X) = H(K) + H(X)

by the independence assumption. Hence

H(K | Y) = H(K) + H(X) - H(Y),

Shannon considers a toy cryptosystem with keys K = \{k,k'\} and \mathcal{P} = \mathcal{C} = \{0,1\}^n, in which a plaintext is encrypted bit-by-bit, as itself if K = k and flipping each bit if K = k'. Suppose that the keys are equiprobable, and that each plaintext bit is 0 independently with probability p, so the probability of a plaintext having exactly s zeros is \binom{n}{s} p^s (1-p)^{n-s}). Denoting this quantity by p_s we have H(X) = -\sum_{s=0}^n p_s \log p_s. The probability that the ciphertext Y has exactly s zeros is (p_s + p_{n-s})/2. Therefore

\begin{aligned} H(Y) &= -\sum_{s=0}^n \frac{p_s + p_{n-s}}{2} \log \frac{p_s + p_{n-s}}{2} \\ &= -\sum_{s=0}^n p_s \log \frac{p_s + p_{n-s}}{2}. \end{aligned}

and so

\begin{aligned} H(K{}&{} |Y) \\ &= 1 - \sum_{s=0}^n p_s \log \frac{p_s}{(p_s+p_{n-s})/2} \\ &= -\sum_{s=0}^n p_s \log \frac{p_s}{p_s + p_{n-s}} \\ &= -\sum_{s=0}^n \binom{n}{s} p^s (1-p)^{n-s} \log \frac{p^s (1-p)^{n-s}}{p^s (1-p)^{n-s} + p^{n-s}(1-p)^s} \\ &= \sum_{s=0}^n \binom{n}{s} p^s (1-p)^{n-s} \log \bigl(1 + p^{n-2s}(1-p)^{2s-n} \bigr) \\ &= \sum_{s=0}^n \binom{n}{s} p^s (1-p)^{n-s} \log \bigl(1 + q^{2s-n} \bigr). \end{aligned}

where q = (1-p)/p.

The graph on page 690 of Shannon’s paper shows H(K | Y) against n for two values of p. He does not attempt to analyse the formula any further, remarking `yet already the formulas are so involved as to be nearly useless’. I’m not convinced this is the case, but the analysis is certainly fiddly.

When p = \frac{1}{2} Shannon’s formula gives H(K | Y) = 1, as expected. We may therefore reduce to the case when p \textgreater 1/2 and so q \textless 1 ). Take n = 2m. The summand for m+u is

\binom{2m}{m+u} p^{m+u}(1-p)^{m-u} \log (1 + q^{2u}).

Since \binom{2m}{m+u}(m-u) = \binom{2m}{m+u+1}(m+u+1), the ratio of the summands for m+u and m+u+1 is

\displaystyle \frac{m+u+1}{m-u} \frac{1-p}{p} \frac{\log (1 + q^{2u})}{\log (1 + q^{2u+2})}.

When u \ge 0 we can use the inequalities

x \ge \log (1+x) \ge x - x^2/2

for x \in [0,1] to bound the ratio below by

\displaystyle \frac{m+u+1}{m-u} \Bigl( \frac{1}{q} -  q^{2u-1}/2 \Bigr).

Therefore the summands in Shannon’s formula get exponentially small, at a rate approximately 1/q > 1 as s increases from n/2. Their contribution can be bounded by summing a geometric series and the same order as the middle term. (This does not require the contribution from the binomial coefficient, which is initially small, but eventually dominates.)

Going the other way, the summands for m-t and m-t-1 is

\displaystyle \frac{m+t+1}{m-t} \frac{p}{1-p} \frac{\log (1 + q^{-2t})}{\log (1 + q^{-2t+2})}

which can be rewritten as

\displaystyle \frac{m+t+1}{m-t} \frac{1}{q} \frac{2t \log \frac{1}{q} + \log (1+q^{2t})}{(2t+2)\log \frac{1}{q} + \log (1+q^{2t+2})}.

It is useful to set p = 1/2 + \rho. If the final fraction were t/(t+1) then the maximum would occur when t/(t+1) = q, i.e. when t = q/(1-q) = (1-p)/(2p-1) = (1/2 -\rho)/\rho. Numerical tests suggest this gives about the right location of the maximum when m large (and so the first fraction can be neglected) and p near to 1/2. For t of the same order of m the first fraction is dominant and gives exponential decay. It is therefore not so surprising that the middle term gives a reasonable lower bound for the entire series, namely

H(K | Y) \ge \frac{A}{\sqrt{n}} \exp (-2 \rho^2 n )

for some constant A. The graph below shows H(K|Y) and the middle summand for p = 1/2 + 1/20, 3/5, 2/3, 4/5 with colours red, green, blue, black.

It seems possible that the lower bound is, up to a multiplicative constant, also an upper bound. However the argument above will show at most that H(K | Y) \le A'\sqrt{n} \exp(-2 \rho^2 n).

Upper bound

A slightly weaker upper bound follows from tail estimates for binomial probabilities. (Update. There might be a stronger result using the Central Limit Theorem.) Let S be distributed as \mathrm{Bin}(p,2m), so \mathbb{P}[S= s] = \binom{2m}{s}p^s (1-p)^{2m-s}. By Hoeffding’s inequality,

\mathbb{P}[S - 2pm \le -2\epsilon m] \le \mathrm{e}^{-4 \epsilon^2 m}.

The argument by exponential decay shows that the contribution to H(K | Y) from the summands for s > m is of the same order as the middle term. Using \log(1+ q^{2s-n}) \le 1 + (n-2s) \log (1/q) we get

\mathbb{P}[S \le n/2] + 2\log (1/q) \sum_{s=0}^m (m-s) \mathbb{P}[S = s]

as an upper bound for the remaining terms. By a standard trick, related to the formula \mathbb{E}[X] = \mathbb{P}[X \ge 0] + \mathbb{P}[X \ge 1] + \cdots for the expectation of a random variable taking values in \mathbb{N}_0, we have

\sum_{s=0}^m (m-s)\mathbb{P}[S=s] = \sum_{s=0}^{m-1} \mathbb{P}[S\le s].

Take \epsilon = p - 1/2 + \alpha in the version of Hoeffding’s inequality to get

\begin{aligned} \mathbb{P}[S \le m(1-\alpha)] &\le \mathrm{e}^{-4 (p-1/2+\alpha)^2 m} \\ &= \mathrm{e}^{-4 (p-1/2)^2 m} \mathrm{e}^{-8 (p-1/2) \alpha m - 4 \alpha^2 m} \\ &\le \mathrm{e}^{-4 \rho^2 m} \mathrm{e}^{-8 \rho \alpha m}. \end{aligned}

Thus the upper bounds for \mathbb{P}[S \le s] become exponentially smaller as we decrease s from m by as much as \alpha m, for any \alpha > 0. By summing a geometric series, as before, we get

\sum_{s=0}^{m-1} P[S\le s] \le Bm \mathrm{e}^{-4 \rho^2 m}

for some constant B. The neglected contributions are of the order of the middle term, so bounded above by \mathrm{e}^{-4 \rho^2 m}. Therefore

H(K | Y) \le Cm \mathrm{e}^{-4 \rho^2 m}

for some further constant C.

Random ciphers and unicity distance

A selection effect

Imagine a hypothetical world where families have either no children, an only child, or two twins, with probabilities 1/4, 1/2, 1/4. The mean number of children per family is therefore 1. In an attempt to confirm this empirically, a researcher goes to a primary school and asks each child to state his or her number of siblings. Twin-families are half as frequent as only-families, but send two representatives to the school rather than one: these effects cancel, so the researcher observes equal numbers of children reporting 0 and 1 siblings. (Families with no children are, of course, never sampled.) The estimate for the mean number of children is therefore the inflated 3/2.

The random cipher

Shannon’s proof of the Noisy Coding Theorem for the memoryless binary channel is a mathematical gem; his chief insight was that a random binary code of suitable size can (with high probability) be used as part of a coding scheme that achieves the channel capacity. In his 1949 paper he considers the analogous random cipher.

Let \mathcal{A} = \{\mathrm{a}, \ldots, \mathrm{z}\} be the Roman alphabet. Let \mathcal{P}_n = \mathcal{C}_n = \mathcal{A}^n and let X_n and Y_n be the random variables recording the plaintext and ciphertext, respectively. We suppose that the message is chosen uniformly at random from those plaintexts that make good sense in English (once spaces are inserted). In yet another fascinating paper Shannon estimated that the per-character redundancy of English, R say, is between 3.2 and 3.7 bits. Thus H(X_n) = (\log_2 26 - R)n and the number of plausible plaintexts is S_n = 2^{(\log_2 26 - R)n}. Let T_n = |\mathcal{P}_n| = 26^n.

Fix t \in \mathbb{N}. We aim to construct a random cipher with exactly t keys. Each key will be chosen with equal probability. A quick reading of (3) on page 691 of Shannon’s paper may suggest a too naive construction: for each ciphertext, we chose t plaintexts uniformly at random from \mathcal{P}_n to be its decryptions. (Some plaintexts may be chosen multiple times, meaning two or more keys decrypt the ciphertext to the same plaintext.) However if we pick the same x \in \mathcal{P}_n first for different ciphertexts y and y', then it is ambiguous whether to encrypt x as y or as y' under the first key. Moreover some plaintexts may never be chosen, giving them no encryption under the first key. Instead, it seems to work best to choose, for each of t keys, a random bijection \mathcal{P}_n \rightarrow \mathcal{C}_n to be the encryption function for that key. This also arrives at the situation required by Shannon in (3).

Let g(y) be the number of keys giving plausible decryptions of the ciphertext y \in \mathcal{C}_n. (The number of keys may be more than the number of plausible decryptions if there are two keys decrypting y to the same plaintext.) Thus g : \mathrm{C}_n \rightarrow \mathbb{N}_0 is a random quantity, where the randomness comes from the choices made in the construction of the random cipher. If Z_n is chosen uniformly at random from \mathcal{C}_n then, for each key, there is a probability |S_n|/|T_n| that the decrypt of Z_n under this key is plausible. Therefore

\displaystyle \mathbb{P}[g(Z_n) = m] = \binom{t}{m} \bigl( \frac{S_n}{T_n} \bigr)^m \bigl( 1 - \frac{S_n}{T_n} \bigr)^{t-m}

and g(Z_n) is distributed binomially as \mathrm{Bin}(S_n/T_n, t).

However this is not the distribution of g(Y_n): since Y_n is the encryption of a plausible plaintext, ciphertexts y with a high g(y) are more frequent, while, as in the family example, ciphertexts y such that g(y) = 0 are never seen at all.

Lemma 5. \mathbb{P}[g(Y_n) = m] = \displaystyle \frac{T_n m}{S_n t} \mathbb{P}[g(Z_n) = m].

Proof. Let \mathcal{P}_n(y) be the multiset of the g(y) plausible plaintexts that are decryptions of y \in \mathcal{C}_n. (Alternatively, and maybe more clearly, we could define \mathcal{P }_n(y) to be the set of pairs (k,x) where k is a key decrypting y to a plausible plaintext x.) Conditioning on the event that x \in \mathcal{P}_n(y) we have

\begin{aligned} \mathbb{P}[g(Y_n) = m] &= \sum_{y \in \mathcal{C}_n \atop g(y) = m} \mathbb{P}[Y=y] \\ &= \sum_{y \in \mathcal{C}_n \atop g(y) = m} \sum_{x \in \mathcal{P}_n(y)} \mathbb{P}[Y=y|X=x]\mathbb{P}[X=x] \\  &= \sum_{y \in \mathcal{C}_n \atop g(y) = m} \sum_{x \in \mathcal{P}_n(y)} \frac{1}{t} \frac{1}{S_n} \\ &= \sum_{y \in \mathcal{C}_n \atop g(y) = m} \frac{m}{kS_n} \\ &= T_n \mathbb{P}[Z_n = m] \frac{m}{S_n t} \\ &= \frac{T_n m}{S_n t} \mathbb{P}[Z_n = m] \end{aligned}

as required. \Box

Corollary 6. The random variable g(Y_n) is distributed as 1+ \mathrm{Bin}(S_n/T_n, t-1).

Proof. By Lemma 5 and the identity m \binom{t}{m} = t \binom{t-1}{m} we have

\begin{aligned} \mathbb{P}[g(Y_n) = m] &= \frac{T_n m}{S_n t}\binom{t}{m} \bigl( \frac{S_n}{T_n} \bigr)^m \bigl( 1 - \frac{S_n}{T_n} \bigr)^{t-m}\\ &= \binom{t-1}{m-1}\bigl( \frac{S_n}{T_n} \bigr)^{m-1} \bigl( 1 - \frac{S_n}{T_n} \bigr)^{t-m}. \end{aligned}

Hence g(Y_n)-1 is distributed as \mathrm{Bin}(S_n/T_n, t-1). \Box

It feels like there should be a quick direct proof of the corollary, along the lines `we know Y has one plausible decrypt; each of the remaining t-1 is plausible with probability S_n/T_n, hence … ‘. But this seems dangerously close to `we know two fair coin flips gave at least one head; the other flip is a head with probability 1/2, hence …’, which gives the wrong answer. The difference is that the plausible decrypt of Y comes with a known key, whereas the `at least one head’ could be either flip. Given the subtle nature of selection effects and conditional probability, I prefer the calculation in the lemma.

Shannon’s paper replaces the lemma with the comment [with one notation change] `The probability of such a cryptogram [our event g(Y) = m] is mT/St, since it can be produced by m keys from high probability messages [our plausible plaintexts] each with probability T/S.’ I cannot follow this: in particular T/S cannot be a probability, since it is far greater than 1.

Given that Y = y \in \mathcal{C}_n, the entropy in the key is H(K|Y = y) = \log g(y). Therefore, going back to the lemma, we have

\begin{aligned} H(K|Y) &= \sum_{m=1}^t \mathbb{P}[g(Y) = m] \log m  \\ &= \sum_{m=1}^k \frac{T_n}{S_n t} \mathbb{P}[Z_n = m] m \log m.\end{aligned}

Shannon argues that if t is large compared to n then \log m is almost constant for m near the mean t S_n/T_n of Z, and so the expected value can be approximated by

\frac{T_n}{S_n t} \frac{t S_n}{T_n} \log \frac{t S_n}{T_n} = \log t + \log S_n - \log T_n.

Observe that \log S_n - \log T_n = nR, where R is the per-character redundancy of English. Therefore Shannon’s approximation becomes

H(K|Y_n) = \log t - nR.

When n is large compared to t, Shannon uses a Poisson approximation. As an alternative, we argue from Corollary 5. Let Z^- be distributed as \mathrm{Bin}(S_n/T_n, t-1). We have

\begin{aligned} H(K|Y_n) &= \mathbb{E}[\log (1 + Z^-)]  \\ &\approx \log \bigl( 1+\mathbb{E} [Z^-] \bigr) \\ &= \log \bigl( 1 + (t-1)S_n/T_n \bigr) \\ &\approx (t-1)S_n/T_n  \\ &\approx t2^{-nR}.\end{aligned}

The graph of H(K|Y_n) is therefore as sketched below.

The quantity H(K) / R = \log t / R is known as the unicity distance of the cipher. Roughly one expects that, after observing n characters of the ciphertext, the key will be substantially known.

A more general random cipher?

It would be interesting to generalize the random cipher to the case when |\mathcal{P}_n| < |\mathcal{C}_n| and we pick random injections.