Shannon’s 1949 paper
Wildon's Weblog 2018-03-12
In 1949 Claude Shannon published Communication theory of secrecy systems, Bell Systems Technical Journal 28 656–715. The members of Bell Labs around this time included Richard Hamming, the three inventors of the transistor and Harry Nyquist, of the Nyquist bound. Its technical journal published many seminal papers, including Shannon’s 1948 paper A mathematical theory of communication defining entropy and Hamming’s 1950 paper Error detecting and error correcting codes, defining Hamming distance and essentially inventing the modern `adversarial’ setting for coding theory.
Incidentally, the story goes that Shannon asked Von Neumann what he should call his new measure of information content and Von Neumann replied
‘You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, no one really knows what entropy really is, so in a debate you will always have the advantage.‘
Like Hamming’s paper, Shannon’s two papers still amply repay reading today. One idea introduced in the 1949 paper is ‘perfect secrecy’.
Perfect secrecy
Consider a cryptosystem with plaintexts and ciphertexts
, with encryption functions
parametrised by a set
of keys.
Shannon supposes that there is a probability distribution on the plaintexts, assigning an a priori probability to each
. There is also an independent probability distribution on the keys. These interact in a rich way. For instance,
is a sum of key probabilities, while
is either zero, or the probability attached to the unique plaintext
. Other probabilities, such as
can be computed from these using Bayes’ Theorem type arguments.
Suppose we observe a ciphertext : what, if anything do we learn about the corresponding plaintext
? Shannon defines the a posteriori probability
to be the conditional probability of the plaintext
given we observed the ciphertext
; the system then has perfect secrecy if, for any a priori probability distribution
, we have
for all
, and all
. (This assumes implicitly that
for every
.)
Shannon proves in his Theorem 6 that a necessary and sufficient condition for perfect secrecy is that for all
and
.
The proof is a short application of Bayes’ Theorem: since , and since we may choose
(which is necessary anyway for the conditional probability to be well-defined), we have
if and only if
.
Corollary. In a system with perfect secrecy, . Moreover, if equality holds then every key must be used with equal probability
and for each
and
there exist a unique
such that
.
Proof Fix a plaintext . We claim that for each
there exists a key
such that
. Indeed, if
is never an encryption of
then, for any choice of a priori probabilities that gives some probability to
, we have
, so by Shannon’s Theorem 6,
. But, by the implicit assumption, there is a non-zero chance of observing
, a contradiction.
Since has at most
different encryptions, the claim implies that
. Moreover if equality holds then for every
there exists a unique
such that
. Since
, where
is this unique key, the conclusion of Theorem 6, that
is constant for
, then implies that each key is equiprobable.
The ‘only if’ direction of Theorem 2.4 in Cryptography: theory and practice (3rd edition) by Douglas Stinson, is the corollary above, but, according to my reading of pages 48 and 50, interpreted with a different definition of perfect secrecy, in which the a priori distribution is fixed, as part of the cryptosystem. (This makes some sense, for example, in a cryptosystem encrypting English plaintexts nonsensical strings should always have zero probability.) Unfortunately, Stinson’s change makes the result false. The diagram below shows a toy cryptosystem with two keys and two plaintexts.
Take and
. Then
and
so the system has perfect secrecy, no matter what probabilities we assign to the keys. (Incidentally, setting and
gives a cryptosystem where we always send
and observe the ciphertext
; the a posteriori probability of
is therefore the same as the a priori probability, so the system has perfect secrecy, even though the keys are not equiprobable. This shows that Shannon’s implicit assumption is not just a technicality required to make the conditional probabilities well-defined.)
The error in the proof of Theorem 2.4 comes in the application of Bayes’ Law, where it is implicitly assumed that . This shows that the extra layer of quantification in Shannon’s paper is not a mere technicality. Given the difficulties students have with nested quantifiers, I’m inclined to keep Stinson’s definition, and fix the problem by assuming
for each
. One can get a more general, but somewhat technical result, by taking an arbitrary cryptosystem and applying Shannon’s Theorem to the subsystem where
is the set of plaintexts with positive probability, and
and
are unchanged. (To be fair to Stinson, he observes before the theorem that plaintexts
such that
are never an obstacle to perfect secrecy, so clearly he was aware of the issue. He also assumes, as we have done, that
for each
, but this is something else.)
Incidentally, there is a subtle English trap in Shannon’s paper: he says, quite correctly, that when , ‘it is possible to obtain perfect secrecy with only this number of keys’. Here `with only’ does not mean the same as ‘only with’.
Example from permutation groups
Given a finite group acting transitively on a set
we obtain a cryptosystem with plaintexts and ciphertexts
and encryption maps
indexed by the elements of
. For which probability distributions on
does this cryptoscheme have perfect secrecy?
Let denote the point stabiliser of
. Since
where is any element such that
, the cryptosystem has perfect secrecy if and only if, for each
,
is constant as varies over
. Call this condition
.
Lemma 2. Suppose that has a regular normal subgroup
. Any probability distribution constant on the cosets of
satisfies
.
Proof. Since , the subgroup
meets each coset of
in a unique element. The same holds for each coset
. Therefore
is independent of .
In the case of the affine cipher, the converse also holds.
Theorem 3. Let be a prime and let
be the affine group on
with normal translation subgroup
. A probability distribution on
satisfies
if and only if it is constant on each of the cosets of
.
Proof. Let generate the multiplicative group of integers modulo
. Then
is a semidirect product
where
is the regular subgroup generated by translation by
and the point stabiliser
is generated by multiplication by
.
A probability distribution on can be regarded as an element of the group algebra
. The probability distributions satisfying
correspond to those
such that, for each
,
is constant as varies over
. Thus if
is the corresponding subspace of
, then
is invariant under left-multiplication by
and right-conjugation by
. But since
for
, taken together, these conditions are equivalent to invariance under left- and right- multiplication by
. It is therefore natural to ask for the decomposition of
as a representation of
, where
acts by
Since is abelian,
for all
. Hence
. The calculation behind this observation is
where in
. This shows that
is stabilised by
and so
, regarded as a representation of
, is induced from the subgroup of
generated by
. By Frobenius Reciprocity,
decomposes as a direct sum of a trivial representation of
, spanned by
and a further non-trivial representations, each with kernel
. Critically all
of these representations are non-isomorphic.
By Lemma 2, contains all
trivial representations. Writing
for their direct sum, we have
for a unique complement
that decomposes uniquely as a sum of the non-trivial representations in the previous paragraph. Therefore if
properly contains
then
contains a non-trivial summand of some
, spanned by some
of the form
.
The support of is
, which meets each point stabiliser in a unique element. Hence all the
must be equal, a contradiction.
This should generalize to an arbitrary Frobenius group with abelian kernel, and probably to other permutation groups having a regular normal subgroup.
The key equivocation given a ciphertext
In practice, we might care more about what an observed ciphertext tells us about the key. This question is considered in Part B of Shannon’s paper. Let ,
and
be random variables representing the plaintext, ciphertext and key, respectively. We assume that
and
are independent. (This is easily motivated: for instance, it holds if the key is chosen before the plaintext, and the plaintext is not a message about the key.) The conditional entropy
, defined by
represents our expected uncertainty, or `equivocation’ to use Shannon’s term, in the key, after we have observed a ciphertext. (Throughout denotes logarithm in base
.)
It is intuitive that : since a given plaintext
can be encrypted to a given ciphertext
by several different keys, given a ciphertext, we are at least as uncertain about the key as we are about the plaintext. The proof is yet another exercise in Bayes’ Theorem, which I’ll include because it sheds some light on the probability model.
Lemma 4. .
Proof. Fix . Let
be the set of plaintext that encrypt, under some key, to
. (Thus
is the union of the inverse images of the singleton set
under all encryption maps.) For
, let
be the set (possibly empty) of keys
such that
. By Bayes’ Theorem,
Similarly
if for some (necessarily unique)
and otherwise this probability is
. Thus
and so the probability distribution on (conditioned on
) is a coarsening of the probability distribution on
(conditioned on
). Therefore
and the lemma now follows by summing over
, using the definition of conditional entropy above.
A related formula is now a textbook staple. If we know the key then we know
if and only if we know
. Therefore the joint entropies
and
agree and
by the independence assumption. Hence
Shannon considers a toy cryptosystem with keys and
, in which a plaintext is encrypted bit-by-bit, as itself if
and flipping each bit if
. Suppose that the keys are equiprobable, and that each plaintext bit is
independently with probability
, so the probability of a plaintext having exactly
zeros is
. Denoting this quantity by
we have
. The probability that the ciphertext
has exactly
zeros is
. Therefore
and so
where .
The graph on page 690 of Shannon’s paper shows against
for two values of
. He does not attempt to analyse the formula any further, remarking `yet already the formulas are so involved as to be nearly useless’. I’m not convinced this is the case, but the analysis is certainly fiddly.
When Shannon’s formula gives
, as expected. We may therefore reduce to the case when
and so
). Take
. The summand for
is
Since , the ratio of the summands for
and
is
.
When we can use the inequalities
for to bound the ratio below by
Therefore the summands in Shannon’s formula get exponentially small, at a rate approximately as
increases from
. Their contribution can be bounded by summing a geometric series and the same order as the middle term. (This does not require the contribution from the binomial coefficient, which is initially small, but eventually dominates.)
Going the other way, the summands for and
is
which can be rewritten as
It is useful to set . If the final fraction were
then the maximum would occur when
, i.e. when
. Numerical tests suggest this gives about the right location of the maximum when
large (and so the first fraction can be neglected) and
near to
. For
of the same order of
the first fraction is dominant and gives exponential decay. It is therefore not so surprising that the middle term gives a reasonable lower bound for the entire series, namely
for some constant . The graph below shows
and the middle summand for
with colours red, green, blue, black.
It seems possible that the lower bound is, up to a multiplicative constant, also an upper bound. However the argument above will show at most that .
Upper bound
A slightly weaker upper bound follows from tail estimates for binomial probabilities. (Update. There might be a stronger result using the Central Limit Theorem.) Let be distributed as
, so
. By Hoeffding’s inequality,
The argument by exponential decay shows that the contribution to from the summands for
is of the same order as the middle term. Using
we get
as an upper bound for the remaining terms. By a standard trick, related to the formula for the expectation of a random variable taking values in
, we have
Take in the version of Hoeffding’s inequality to get
Thus the upper bounds for become exponentially smaller as we decrease
from
by as much as
, for any
. By summing a geometric series, as before, we get
for some constant . The neglected contributions are of the order of the middle term, so bounded above by
. Therefore
for some further constant .
Random ciphers and unicity distance
A selection effect
Imagine a hypothetical world where families have either no children, an only child, or two twins, with probabilities ,
,
. The mean number of children per family is therefore
. In an attempt to confirm this empirically, a researcher goes to a primary school and asks each child to state his or her number of siblings. Twin-families are half as frequent as only-families, but send two representatives to the school rather than one: these effects cancel, so the researcher observes equal numbers of children reporting
and
siblings. (Families with no children are, of course, never sampled.) The estimate for the mean number of children is therefore the inflated
.
The random cipher
Shannon’s proof of the Noisy Coding Theorem for the memoryless binary channel is a mathematical gem; his chief insight was that a random binary code of suitable size can (with high probability) be used as part of a coding scheme that achieves the channel capacity. In his 1949 paper he considers the analogous random cipher.
Let be the Roman alphabet. Let
and let
and
be the random variables recording the plaintext and ciphertext, respectively. We suppose that the message is chosen uniformly at random from those plaintexts that make good sense in English (once spaces are inserted). In yet another fascinating paper Shannon estimated that the per-character redundancy of English,
say, is between
and
bits. Thus
and the number of plausible plaintexts is
. Let
.
Fix . We aim to construct a random cipher with exactly
keys. Each key will be chosen with equal probability. A quick reading of (3) on page 691 of Shannon’s paper may suggest a too naive construction: for each ciphertext, we chose
plaintexts uniformly at random from
to be its decryptions. (Some plaintexts may be chosen multiple times, meaning two or more keys decrypt the ciphertext to the same plaintext.) However if we pick the same
first for different ciphertexts
and
, then it is ambiguous whether to encrypt
as
or as
under the first key. Moreover some plaintexts may never be chosen, giving them no encryption under the first key. Instead, it seems to work best to choose, for each of
keys, a random bijection
to be the encryption function for that key. This also arrives at the situation required by Shannon in (3).
Let be the number of keys giving plausible decryptions of the ciphertext
. (The number of keys may be more than the number of plausible decryptions if there are two keys decrypting
to the same plaintext.) Thus
is a random quantity, where the randomness comes from the choices made in the construction of the random cipher. If
is chosen uniformly at random from
then, for each key, there is a probability
that the decrypt of
under this key is plausible. Therefore
and is distributed binomially as
.
However this is not the distribution of : since
is the encryption of a plausible plaintext, ciphertexts
with a high
are more frequent, while, as in the family example, ciphertexts
such that
are never seen at all.
Lemma 5. .
Proof. Let be the multiset of the
plausible plaintexts that are decryptions of
. (Alternatively, and maybe more clearly, we could define
to be the set of pairs
where
is a key decrypting
to a plausible plaintext
.) Conditioning on the event that
we have
as required.
Corollary 6. The random variable is distributed as
.
Proof. By Lemma 5 and the identity we have
Hence is distributed as
.
It feels like there should be a quick direct proof of the corollary, along the lines `we know has one plausible decrypt; each of the remaining
is plausible with probability
, hence … ‘. But this seems dangerously close to `we know two fair coin flips gave at least one head; the other flip is a head with probability
, hence …’, which gives the wrong answer. The difference is that the plausible decrypt of
comes with a known key, whereas the `at least one head’ could be either flip. Given the subtle nature of selection effects and conditional probability, I prefer the calculation in the lemma.
Shannon’s paper replaces the lemma with the comment [with one notation change] `The probability of such a cryptogram [our event ] is
, since it can be produced by
keys from high probability messages [our plausible plaintexts] each with probability
.’ I cannot follow this: in particular
cannot be a probability, since it is far greater than
.
Given that , the entropy in the key is
. Therefore, going back to the lemma, we have
Shannon argues that if is large compared to
then
is almost constant for
near the mean
of
, and so the expected value can be approximated by
Observe that , where
is the per-character redundancy of English. Therefore Shannon’s approximation becomes
When is large compared to
, Shannon uses a Poisson approximation. As an alternative, we argue from Corollary 5. Let
be distributed as
. We have
The graph of is therefore as sketched below.
The quantity is known as the unicity distance of the cipher. Roughly one expects that, after observing
characters of the ciphertext, the key will be substantially known.
A more general random cipher?
It would be interesting to generalize the random cipher to the case when and we pick random injections.