Oh, Wikipedia
Three-Toed Sloth 2017-01-11
Summary:
Wikipedia is a tremendous accomplishment and an invaluable resource. It is also highly unreliable. Since I have just spent a bit of time on the second fork, let me record it here for posterity.
A reader of my notebook on information theory wanted to know whether I made a mistake there when I said that "self-information" is, in information theory, just an alternative name for the entropy of a random variable. After all, he said, the Wikipedia article on self-information (version of 22 July 2016) says that the self-information of an event (not a random variable) is the negative log probability of that event*. What follows is modified from my reply to my correspondent.
(1) my usage is the one I learned from my teachers and textbooks; (2) the Wikipedia page is the first time I have ever seen this other usage; and (3) the references given by the Wikipedia page do not actually support the usage it advocates; only one of them even uses the term "self-information", and that supports my usage, rather than the page's.
To elaborate on (3), the Wikipedia page cites as references (a) a paper by Meila on comparing clusterings, (b) Cover and Thomas's standard textbook, and (c) Shannon's original paper. (a) is a good paper, but in fact never uses the phrase "self-information" (or "self information", etc.). For (b), the Wikipedia page cites p. 20 of the first edition from 1991, which I no longer have; but in the 2nd edition, "self-information" appears just once, on p. 21, as a synonym for entropy ("This is the reason that entropy is sometimes referred to as self-information"; their italics). As for (c), "self-information" does not appear anywhere in Shannon's paper (nor, more remarkably, does "mutual information"), and in fact Shannon gives no name to the quantity \( -\log{p(x)} \).
There are also three external links on the page: the first ("Examples of surprisal measures") only uses the word "surprisal". The second, " 'Surprisal' entry in a glossary of molecular information theory", again only uses the word "surprisial" (and that glossary has no entry for "self-information"). The third, "Bayesian Theory of Surprise", does not use either word, and in fact defines "surprise" as the KL divergence between a prior and a posterior distribution, not using $-\log{p(x)}$ at all. The Wikipedia page is right that $-\log{p(x)}$ is sometimes called "surprisal", though "negative log likelihood" is much more common in statistics, and some more mathematical authors (e.g., R. M. Gray, Entropy and Information Theory [2nd ed., Springer, 2011], p. 176) prefer "entropy density". But, as I said, I have never seen anyone else call it "self-information". I am not sure where this strange usage began, but I suspect it's something some Wikipedian just made up. The error seems to go back to the first version of the page on self-information, from 2004 (which cites no references or sources at all). It has survived all 136 subsequent revisions. None of those revisions, it appears, ever involved checking whether the original claim was right, or indeed even whether the external links and references actually supported it.
I could, of course, try to fix this myself, but it would involve replacing the page with something about one sentence long, saying "In information theory, 'self-information' is a synonym for the entropy of a random variable; it is the expected value of the 'surprisal' of a random event, but is not the same as the surprisal." Leaving aside the debate about whether a topic which can be summed up in a sentence deserves a page of its own; I am pretty certain that if I didn't waste a lot of time defending the edit, it would swiftly be reverted. I have better things to do with my time**
How many other Wikipedia pages are based on similar mis-understandings and inventions, I couldn't begin to say. Nor could I pretend to guess whether Wikipedia has more such errors than traditional encyclopedias.
*: The (Shannon) entropy of a random variable \( X \), with probability mass function \( p(x) \), is of course just \( H[X] \equiv -\sum_{x}{p(x) \log{p(x)}} \). The conditional entropy of one random variable \( Y \) given a particular value of another is just the entropy of the conditional distribution, \( H[Y|X=x] \equiv -\sum_{y}{p(y|x) \log{p(y|x)}} \). The conditional entropy is the average of this, \( H[Y|X] \equiv -\sum