Random-Feature Matching

Three-Toed Sloth 2021-11-18

Summary:

\[ \newcommand{\ModelDim}{d} \]
Attention conservation notice: Academic self-promotion.

So I have a new preprint:

CRS, "A Note on Simulation-Based Inference by Matching Random Features", arxiv:2111.09220
We can, and should, do statistical inference on simulation models by adjusting the parameters in the simulation so that the values of randomly chosen functions of the simulation output match the values of those same functions calculated on the data. Results from the "state-space reconstruction" or "geometry from a time series" literature in nonlinear dynamics indicate that just $2\ModelDim+1$ such functions will typically suffice to identify a model with a $\ModelDim$-dimensional parameter space. Results from the "random features" literature in machine learning suggest that using random functions of the data can be an efficient replacement for using optimal functions. In this preliminary, proof-of-concept note, I sketch some of the key results, and present numerical evidence about the new method's properties. A separate, forthcoming manuscript will elaborate on theoretical and numerical details.

I've been interested for a long time in methods for simulation-based inference. It's increasingly common to have generative models which are easy (or at least straightforward) to simulate, but where it's completely intractable to optimize the likelihood --- often it's intractable even to calculate it. Sometimes this is because there are lots of latent variables to be integrated over, sometimes due to nonlinearities in the dynamics. The fact that it's easy to simulate suggests that we should be able to estimate the model parameters somehow, but how?

An example: My first Ph.D. student, Linqiao Zhao, wrote her dissertation on a rather complicated model of one aspect of how financial markets work (limit-order book dynamics), and while the likelihood function existed, in some sense, the idea that it could actually be calculated was kind of absurd. What she used to fit the model instead was a very ingenious method which came out of econometrics called "indirect inference". (I learned about it by hearing Stephen Ellner present an ecological application.) I've expounded on this technique in detail elsewhere, but the basic idea is to find a second model, the "auxiliary model", which is mis-specified but easy to estimate. You then adjust the parameters in your simulation until estimates of the auxiliary from the simulation match estimates of the auxiliary from the data. Under some conditions, this actually gives us consistent estimates of the parameters in the simulation model. (Incidentally, the best version of those regularity conditions known to me are still those Linqiao found for her thesis.)

Now the drawback of indirect inference is that you need to pick the auxiliary model, and the quality of the model affects the quality of the estimates. The auxiliary needs to have at least as many parameters as the generative model, the parameters of the auxiliary need to shift with the generative parameters, and the more sensitive the auxiliary parameters are to the generative parameters, the better the estimates. There are lots of other techniques for simulation-based inference, but basically all of them turn on this same issue of needing to find some "features", some functions of the data, and tuning the generative model until those features agree between the simulations and the data. This is where people spend a lot of human time, ingenuity and frustration, as well as relying on a lot of tradition, trial-and-error, and insight into the generative model.

What occurred to me in the first week of March 2020 (i.e., just before things got really interesting) is that there might be a short-cut which avoided the need for human insight and understanding. That week I was teaching kernel methods and random features in data mining, and starting to think about how I wanted to revise the material on simulation-based inference for my "data over space and time" in the fall. The two ideas collided in my head, and I realized that there was a lot of potential for estimating parameters in simulation models by matching random features, i.e.,

Link:

http://bactra.org/weblog/raffle.html

From feeds:

Statistics and Visualization ยป Three-Toed Sloth

Tags: