Hill climbing on substitution ciphers

Wildon's Weblog 2018-08-02

On Saturday the BSHM held a fascinating meeting The history of cryptography and codes. In the second talk, Solving Historical Ciphers with Modern Means, Klaus Schmeh mentioned an attack on the old-school Turning Grille cipher by the thoroughly new-school method of hill climbing. This got me thinking about whether hill climbing could be an effective attack on the monoalphabetic substitution cipher.

As a running example, we use the ciphertext below; it is the encryption, by a randomly chosen substitution cipher, of the first two sentences in Section 1.1 of Stinson’s highly recommended book Crpytography: Theory and Practice, punctuation and spaces deleted.

kqxwjzruhxzkuygtoxskpixgwsmbfkgvmufqbplkgxzutyxkdgfxgfyxjljuyybmxwxmmxr kguluypsxuzrtgtkgsghhjzpsukxgixmuzpzlxsjmxsquzzxypzljsqudubkqukuzgffgzx zkglsumsuzzgkjzrxmlkuzrdqukpltxpzvluprkqplsquzzxysgjyrtxukxyxfqgzxypzxg msghfjkxmzxkdgmewgmxcuhfy

Frequency analysis shows that the most common letters are xzugk with 30, 23, 23, 21 and 19 occurrences, respectively. We therefore guess that x is the encryption of e, that z and u are the encryptions of t and a (in some order), and so on. Decrypting by the key zpxyjcofvuidsqkrlmnwabgeht implied by the frequency counts gives

ilegutmafetiahowkenirveognspciobsaclprdioetawheiyoceocheuduahhpsegessem ioadahrneatmwowionoffutrnaieovesatrtdenusenlattehrtdunlayapilaiatoccote tiodnasnattoiutmesdiatmylairdwertbdarmilrdnlattehnouhmweaieheclotehrteo snofcuiesteiyosjgosexafche

From here I think it is feasible—but certainly non-trivial—to complete the decryption by hand. A good place to start is the final seven letters exafche, on the assumption that the (common) e and a are correct, and the (rare) x is also correct.

For a hill climb, we need

a place to start;
a way to step;
a scoring function.

The obvious starting point is the key guessed by frequency analysis. Repeated steps must allow the climb to reach all $26!$ permutations of the Roman alphabet. A natural generating set for the symmetric group on the alphabet is the $26 \times 25 /2 = 325$ transpositions. Thus in each step we interchange two letters of the decryption key. For instance, in the first step in the example below, zpxyj … becomes zyxpk… swapping the decrypts of b and d.

A good scoring function is far less obvious. In my first try, I scored a putative plaintext of length $n$ by $\sum_{i=1}^n s_i$ , where $s_i$ is the length of the longest English word starting in position $i$ . For example, exampletext gives scores (7,0,1,0,0,3,0,4,0,0,0), taking words from a small dictionary containing let but not ample. After 25 steps, the score increases from 115 to 187, and the hill climb then aborts, since no step improves the score. Unfortunately the putative plaintext is by this point

dmeyitcanetdaboxkesduweoysfprdojfarmpuhdoetaxbedgoreorbeihiabbpfeyeffec doahabuseatcxoxdosonnitusadeowefatuthesifesmattebuthismagapdmadatorrote tdohsafsattoditcefhdatcgmaduhxeutjhaucdmuhsmattebsoibcxeadebermotebuteo fsonrideftedgoflyofevanrbe

which, while a local maximum of the hill climb, seems barely more ‘Englishy’ than the first guess. Permitting two transpositions to be applied in each single step, the hill climb continued for 40 steps (the final 15 steps each taking a minute or so), and aborted with

thefilkaneltabomcestupeofsrywtojrawhyudtoelambetgoweowbeidiabbyreferrek toadabusealkmomtosonnilusateoperaluldesireshallebuldishagaythatalowwole ltodsarsallotilkerdtalkghatudmeuljdaukthudshallebsoibkmeatebewholebuleo rsonwiterletgorxforevanwbe

of score 219. While closer to the truth, this seems a poor return on so much computational time. Increasing the dictionary size, and trying variant statistics, for example, the sum squared of the subword lengths, gave equally discouraging results.

At this point I searched for prior art, and found Cryptanalysis of the Simple Substitution Cipher, which recommends a score based on quadgram frequencies. For instance the most common quadgram is tion, with probability $p_{\mathrm{tion}} = 0.00312$ , followed by nthe and ther. The suggested score is $\sum_q \log p_q$ where the sum is over all quadgrams $q$ in the putative plaintext. Using transpositions as steps, the hill climb took 31 steps (and about 10 seconds, most of which is spent building a look-up table of quadgram frequencies) to improve the score from $-3188.91$ to $-2278.94$ . Half-way through the putative plaintext is

rlegutfametrahowvecrikeogcsyprobsaplyidroetawhernopeopheuduahhysegessef roadahiceatfwowrocommuticareokesatitdecuseclattehitduclanayrlaratoppote trodcascattorutfesdratfnlaridweitbdaifrlidclattehcouhfweareheplotehiteo scompuresternoszgosexamphe

and the final putative plaintext is correct in every character:

thefundamentalobjectiveofcryptographyistoenabletwopeopleusuallyreferred toasaliceandbobtocommunicateoveraninsecurechannelinsuchawaythatanoppone ntoscarcannotunderstandwhatisbeingsaidthischannelcouldbeatelephonelineo rcomputernetworkforexample

For the record, the decryption key is zyxwkpomvutsrqjihdcbagfeln. Similarly impressive results were obtained using trigrams or quintgrams. (Some online code I wrote for the trigram attack is linked below.) Bigrams with transposition steps aborted after 23 steps at the local maximum

rmebuthapetradofjesriveobsnglrownalmgicroetafderyoleoldeucuaddgnebenneh roacadiseathfofrosopputisareovenatitcesunesmatteditcusmayagrmaratollote trocsansattoruthencrathymaricfeitwcaihrmicsmattedsoudhfearedelmotediteo nsoplurenteryonkbonexaplde

but allowing double transpositions the hill climb reaches (after 37 steps and a few minutes)

thebundamentalofjectiveobcryptographyistoenafletwopeopleusuallyreberred toasaliceandfoftocommunicateoveraninsecurechannelinsuchawaythatanoppone ntoscarcannotunderstandwhatisfeingsaidthischannelcouldfeatelephonelineo rcomputernetworkborexample

which is correct up to a single transposition. This transposition is a possible step, but it is rejected by the algorithm since making it worsens the bigram score, from $-1308.83$ to $-1309.56$ . Thus for bigrams with double transposition steps, this plaintext is not even a local maximum of the hill climb. (Despite this theoretical obstruction, there is obviously no problem completing the decryption.)

Some final thoughts.

My first choice of scoring function suffers on the example text by getting stuck at local maxima that have some plausible chunks (for example, consider the extract … operaluldesireshallebuldish, while other parts look completely wrong. The less ‘brittle’ quadgram statistic appears to lead to a smoother, steady improvement.
The score by summing the logs of the probabilities of digrams is suggested by maximum likelihood estimates. Thus the more successful score has the more solid theoretical basis.
These experiments suggest that, at least for the substitution cipher, the choice of scoring function is far more important than the choice of steps. My attempt to salvage a poor scoring function by increasing the number of possible steps was somewhat successful, at the cost of vastly increased running time. Given the exponential growth in the search space, it seems clear that a small menu of steps, directed by a good scoring function, and possibly iterated many times, is the ideal to aim for.
A simplified version of the Haskell code I wrote is online here. Because of limitations of the online environment, it only works with bigrams and a reduced list of 2500 trigrams. Some trials suggest that the reduced trigrams suffice on most ciphertexts longer than a few sentences.

In my Cipher Systems course I had a computer demonstration in most lectures. I used Mathematica at first, in the hope (partially realised), that students would be able to re-use the notebooks for problem sheets, but switched to Haskell for the more computationally expensive attacks later in the course. My worry was that, since I never had time to explained the code beyond the bare mathematical details of the underlying algorithm, this would just deepen the students’ mystification. Feedback suggests, however, that these demonstrations were found surprisingly helpful.