Frictionless reproducibility; methods as proto-algorithms; division of labor as a characteristic of statistical methods; statistics as the science of defaults; statisticians well prepared to think about issues raised by AI; and robustness to adversarial attacks

R-bloggers 2023-10-13

Tian points us to this article by David Donoho, which argues that some of the rapid progress in data science and AI research in recent years has come from “frictionless reproducibility,” which he identifies with “data sharing, code sharing, and competitive challenges.” This makes sense: the flip side of the unreplicable research that has destroyed much of social psychology, policy analysis, and related fields is that when we can replicate an analysis with a press of a button using open-source software, it’s much easier to move forward.

Frictionless reproducibility

Frictionless reproducibility is a useful goal in research. It can take a while between the development of a statistical idea and its implementation in a reproducible way, and that’s ok. But it’s good to aim for that stage. The effort it takes to make a research idea reproducible is often worth it, in that getting to reproducibility typically requires a level of care and rigor beyond what is necessary just to get a paper published. One thing I’ve learned with Stan is that much is learned in the process of developing a general tool that will be used by strangers.

I think that statisticians have a special perspective for thinking about these issues, for the following reason:

Methods as proto-algorithms

As statisticians, we’re always working with “methods.” Sometimes we develop new methods or extend existing methods; sometimes we place existing methods into a larger theoretical framework; sometimes we study the properties of methods; sometimes we apply methods. Donoho and I are typical of statistics professors in having done all these things in our work.

A “method” is a sort of proto-algorithm, not quite fully algorithmic (for example, it could require choices of inputs, tuning parameters, expert inputs at certain points) but it follows some series of steps. The essence of a method is that it can be applied by others. In that sense, any method is a bridge between different humans; it’s a sort of communication among groups of people who may never meet or even directly correspond. Fisher invented logistic regression and decades later some psychometrician uses it; the method is a sort of message in a bottle.

Division of labor as a characteristic of statistical methods

There are different ways to take this perspective. One direction is to recognize that almost all statistical methods involve a division of labor. In Bayes, one agent creates the likelihood model and another agent creates the prior model. In bootstrap, one agent comes up with the estimator and another agent comes up with the bootstrapping procedure. In classical statistics, one agent creates the measurement protocol, another agent designs the experiment, and a third agent performs the analysis. in machine learning, there’s the training and test sets. With public surveys, one group conducts the survey and computes weights; other groups analyze the data using the weights. Etc. We discussed this general idea a few years ago here.

But that’s not the direction I want to go right here. Instead I want to consider something else, which is the way that a “method” is an establishment of a default; see here and also here.

Statistics as the science of defaults

The relevance to the current discussion is that, to the extent that defaults are a move toward automatic behavior, statisticians are in the business of automating science. That is, our methods are “successes” to the extent that they enable automatic behavior on the part of users. As we have discussed, automatic behavior is not a bad thing! When we make things automatic, users can think at the next level of abstraction. For example, push-button linear regression allows researchers to focus on the model rather than on how to solve a matrix equation, and it can even take them to the next level of abstraction and think about prediction without even thinking about the model. As teachers and users of research, we then are (rightly) concerned that lack of understanding can be a problem, but it’s hard to go back. We might as well complain that the vast majority of people drive their cars with no understanding of how those little explosions inside the engine make the car go round.

Statisticians well prepared to think about issues raised by AI

To get back to the AI issue: I think that we as statisticians are particularly well prepared to think about the issues that AI brings, because the essence of statistics is the development of tools designed to automate human thinking about models and data. Statistical methods are a sort of slow-moving AI, and it’s kind of always been our dream to automate as much of the statistics process as possible, while recognizing that for Cantorian reasons (see section 7 here) we will never be there. Given that we’re trying, to a large extent, to turn humans into machines or to routinize what has traditionally been a human behavior that has required care, knowledge, and creativity, we should have some insight into computer programs that do such things.

In some ways, we statisticians are even more qualified to think about this than computer scientists are, in that the paradigmatic action of a computer scientist is to solve a problem, whereas the paradigmatic action of a statistician is to come up with a method that will allow other people to solve their problems.

I sent the above to Jessica, who wrote:

I like the emphasis on frictionless reproducibility as a critical driver of the success in ML. Empirical ML has clearly emphasized methods for ensuring the validity of predictive performance estimates (hold out sets, common task framework etc) compared to fields that use statistical modeling to generate explanations, like social sciences, and it does seem like that has paid off.

From my perspective, there’s something else that’s been very successful though as well – post-2015ish there’s been a heavy emphasis on making models robust to adversarial attack. Being able to take an arbitrary evaluation metric and incorporate it into your loss function so you’re explicitly training for it is also likely to improve things fast. We comment on this a bit in a paper we wrote last year reflecting on what, if anything, recent concerns about ML reproducibility and replicability have in common with the so-called replication crisis in social science.

I do think we are about at max hype currently in terms of perceived success of ML though, and it can be hard to tell sometimes how much the emerging evidence of success from ML research is overfit to the standard benchmarks. Obviously there have been huge improvements on certain test suites, but just this morning for instance I saw an ML researcher present a pretty compelling graph showing that the “certified robustness” of the top LLMs (GPT-3.5, GPT 4, llambda 2, etc), when trained on the common datasets (imagenet, mnist, etc), has not really improved much at all in the past 7-8 years. This was a line graph where each line denoted changes in robustness for different benchmarks (imagenet, mnist, etc) with new methodological advances. Each point in a line represented the robustness of a deep net on that particular benchmark given whatever was considered the state of the art in robust ML at that time. The x-axis was related to time, but each tick represented a particular paper that advanced SOTA. It’s still very easy to trick LLMs into generating toxic text, leaking private data they trained on, or changing their mind based on what should be an inconsequential change to the wording of a prompt, for example.