We have really everything in common with machine learning nowadays, except, of course, language.
Statistical Modeling, Causal Inference, and Social Science 2022-05-13
I had an interesting exchange with Bob regarding the differences between statistics and machine learning. If it were just differences in jargon, it would be no big deal—you could just translate back and forth—but it’s trickier than that, because the two subfields also have different priorities and concepts.
It started with this abstract by Satyen Kale in Columbia’s statistical machine learning seminar:
Learning linear predictors with the logistic loss—both in stochastic and online settings—is a fundamental task in machine learning and statistics, with direct connections to classification and boosting. Existing “fast rates” for this setting exhibit exponential dependence on the predictor norm, and Hazan et al. (2014) showed that this is unfortunately unimprovable. Starting with the simple observation that the logistic loss is 1-mixable, we design a new efficient improper learning algorithm for online logistic regression that circumvents the aforementioned lower bound with a regret bound exhibiting a doubly-exponential improvement in dependence on the predictor norm. This provides a positive resolution to a variant of the COLT 2012 open problem of McMahan and Streeter when improper learning is allowed. This improvement is obtained both in the online setting and, with some extra work, in the batch statistical setting with high probability. Leveraging this improved dependency on the predictor norm yields algorithms with tighter regret bounds for online bandit multiclass learning with the logistic loss, and for online multiclass boosting. Finally, we give information-theoretic bounds on the optimal rates for improper logistic regression with general function classes, thereby characterizing the extent to which our improvement for linear classes extends to other parametric and even nonparametric settings. This is joint work with Dylan J. Foster, Haipeng Luo, Mehryar Mohri and Karthik Sridharan.
What struck me was how difficult it was for me to follow what this abstract was saying!
“Learning linear predictors”: does this mean estimating coefficients, or deciding what predictors to include in the model?
“the logistic loss”: I’m confused here. In logistic regression we use the log loss (that is, the contribution to the likelihood is log(p)); the logistic comes in the link function. So I can’t figure this out. If for example we were instead doing probit regression, would it be probit loss? There’s something I’m not catching here. I’m guessing that they are using the term “loss” in a slightly different way than we would.
“exponential dependence on the predictor norm”: I have no idea what this is.
“1-mixable”: ?
“a regret bound”: I don’t know what that is either!
“the COLT 2012 open problem of McMahan and Streeter”: This is news to me. I’ve never even heard of COLT before.
“improper learning”: What’s that?
“online bandit multiclass learning”: Now I’m confused in a different way. I think of logistic regression for 0/1 data, but multiclass learning, I’m guessing that’s for data with more than 2 categories?
“online multiclass boosting”: I don’t actually know what this is either, but I’ve heard of “boosting” (even though I don’t know exactly what it is).
“improper logistic regression”: What is that? I don’t think it’s the same as improper priors.
I know I could google all the above terms, and I’m not faulting the speaker for using jargon, any more than I should be faulted for using Bayesian jargon in my talks. It’s just interesting how difficult the communication is.
I sent the above to Bob, who replied:
Stats seems more concerned with asymptotic consistency and bias of estimators than about convergence rate. And I’ve never seen anyone in stats talk about regret. The whole setup is online (data streaming inference, which they call “stochastic”). But then I don’t get out much.
I have no idea why they say online logistic regression is a fundamental task. It seems more like an approach to classification than a task itself. But hey, that’s just me being picky about language.
I have no idea what 1-mixable is or what a predictor norm is, and I wasn’t enlightened as to what “improper” means after reading the abstract.
Again, not at all a slam on Satyen Kale. It would be just about as hard to figure out what I’m talking about, based on one of my more technical abstracts. This is just an instance of the general problem of specialized communication: jargon is confusing for outsiders but saves time for insiders.
I suggested to Bob that what we need is some sort of translation, and he responded:
I know some of this stuff. Regret is the expected difference in utility between the strategy you played and the optimal strategy (in simple bandit problems, optimal is always playing the best arm). Regret bounds are strategies for pulling arms in an explore/exploit way that bounds regret. This is what you’ll need to connect to the ML literature on online (subject at a time being assigned) A/B testing.
COLT is the conference on learning theory. That’s where they do PAC learning. If you don’t know PAC learning is, well, google it. . . .
I think “logistic loss” was either a typo or to distinguish their use of logistic regression from general 0/1 loss stuff.
Online bandit multiclass: online means one example at a time or one subject a time where you can control assignment to any of k treatments (or a control). Yes, you can use multi-logit for this as we were discussing the other day. It’s in the Stan user’s guide regression chapter.
Boosting is a technique where you iteratively train, upweighting examples each iteration where there were errors in the previous iteration, then you weight all the predictors at each iteration. Given its heuristic nature, there are a gazillion variants. It’s usually used with decision “stumps” (shallow decision trees) a la BART.
But I have no clue what “improper” means despite being in the title.
As an aside, I’m annoyed at the linguistic drift by which “classification and regression trees” have become “decision trees.” This seems inaccurate and misleading to me, as no decisions are involved. “Regression trees” or “Prediction trees” would seem more accurate to me. As with other such linguistic discussions, my goal here is not purity or correctness but rather accuracy and minimization of confusion.
Anyway, to continue the main thread, Bob summarizes the themes of the above discussion as: “us all being finite, academia being siloed, and communication being harder than math.” At a technical level, it seems that the key difference is that machine learning is focused on online learning, while statistics is focused on static learning. This is part of the general pattern that computer scientists work on larger problems than statisticians do.