Aaronson, COLT, Bayesians and Frequentists

Normal Deviate 2013-05-11

I am reading Scott Aaronson’s book “Quantum Computing Since Democritus” which can be found here.

The book is about computational complexity, quantum mechanics, quantum computing and many other things. It’s a great book and I highly recommend it. Much of the material on complexity classes is tough going but you can skim over some of the details and still enjoy the book. (That’s what I am doing.) There at least 495 different complexity classes: see the complexity zoo. I don’t know how anyone can keep track of this.

Anyway, there is a chapter on computational learning theory that I wanted to comment on. (There is another chapter about probabilistic reasoning and the anthropic principle which I’ll comment on in a future post.) Scott gives a clear introduction to learning theory and he correctly traces the birth of the theory to Leslie Valiant’s 1984 paper that introduced PAC (probably almost correct) learning. He also contrasts PAC learning with Bayesian learning.

Now I want to put on my statistical curmudgeon hat and complain about computational learning theory. My complaint is this: the discovery of computational learning theory is nothing but the re-discovery of a 100 year old statistical idea called a “confidence interval.”

Let’s review some basic learning theory. Let ${{\cal R}}$ denote all axis aligned rectangles in the plane. We can think of each rectangle ${R}$ as a classifier: predict ${Y=1}$ if ${X\in R}$ and predict ${Y=0}$ if ${X\notin R}$ . Suppose we have data ${(X_1,Y_1),\ldots, (X_n,Y_n)}$ . If we pick the rectangle ${\hat R}$ that makes the fewest classification errors on the data, will we predict well on a new observation? More formally: is the empirical risk estimator a good predictor?

Yes. The reason is simple. Let ${L(R) = \mathbb{P}(Y\notin R)}$ be the prediction risk and let

$\displaystyle L_n(R) = \frac{1}{n}\sum_{i=1}^n I(Y_i \notin R)$

be the empirical estimate of the risk. We would like to claim that ${\hat R}$ is close to the best classifier in ${{\cal R}}$ . That is, we would like to show that ${L(\hat R)}$ is close to ${\inf_{R\in {\cal R}} L(R)}$ , with high probability. This fact follows easily if we can show that

$\displaystyle \sup_{R\in {\cal R}} | L_n(R) - L(R)|$

is small with high probability. And this does hold since the VC dimension of ${{\cal R}}$ is finite. Specifically, the key fact is that, for any distribution of the data ${P}$ , we have

$\displaystyle P^n\Bigl(\sup_{R\in {\cal R}} | L_n(R) - L(R)| > \epsilon\Bigr) \leq c_1 \exp\left( - c_2 n \epsilon^2 \right)$

for known constants ${c_1}$ and ${c_2}$ .

But this is equivalent to saying that a ${1-\alpha}$ confidence interval for ${L(\hat R)}$ is

$\displaystyle C_n = [L_n(\hat R) - \epsilon_n,\ L_n(\hat R) + \epsilon_n]$

where

$\displaystyle \epsilon_n = \sqrt{\frac{1}{n c_2}\log\left(\frac{c_1}{\alpha}\right)}.$

That is, for any distribution ${P}$ ,

$\displaystyle P^n( L(\hat R) \in C_n)\geq 1-\alpha.$

As Scott points out, what distinguishes this type of reasoning from Bayesian reasoning, is that we require this to hold uniformly, and that there are no priors involved. To quote from his book:

This goes against a belief in the Bayesian religion, that if your priors are different then you come to an entirely different conclusion. The Bayesian starts out with a probability distribution over the possible hypotheses. As you get more and more data, you update this distribution using Bayes’s rule.

That’s one way to do it, but computational learning theory tells us that it’s not the only way. You don’t need to start out with any assumptions about a probability distribution over the hypotheses … you’d like to learn any hypothesis in the concept class, for any sample distribution, with high probability over the choice of samples. In other words, you can trade the Bayesians’ probability distribution over hypotheses for a probability distribution over sample data.

(Note: “hypothesis” = classifier and “concept class” = set of classifiers, “learn” = estimate).

Now, I agree completely with the above quote. But as I said, it is basically the definition of a frequentist confidence interval.

So my claim is that computational learning theory is just the application of frequentist confidence intervals to classification.

There is nothing bad about that. The people who first developed learning theory were probably not aware of existing statistical theory so they re-developed it themselves and they did it right.

But it’s my sense — and correct me if I’m wrong — that many people in computational learning theory are still woefully ignorant about the field of statistics. It would be nice if someone in the field read the statistics literature and said: “Hey, these statistics guys did this 50 years ago!”

Am I being too harsh?