Generative-model-free likelihood

R-bloggers 2021-01-21

This post is a follow-up of my previous post on the likelihood principle and model evaluation. This time I would like to start by a more practical angle:

Many ML methods are mode-based-optimization: we are given an objective function f(theta, y) and we will solve theta by the mode. It is tempting to convert this objective function into some posterior density p(theta|y)= g(f(theta, y)).

To be concrete, think about a “model free” least-square regression: min sum |y-beta x|^2. We can convert it into a inference-equivalent model y~normal (beta x, sigma) by augmenting a sigma. But essentially this least-square objective function does not specify a generative model on y: The least-square is still a valid estimate if the actual generative model is log p (y| beta, x)= -|y-beta x|^2 + sin(y) – y^10 + C.

We often recommend to avoid such ambiguity by always modeling the generative model in the first place (then we can do predictive model check, etc). But sometimes the objective function is given and we need to reverse engineer the generative model.

I run into this problem when I think about how we make Bayesian inference of stacking: after fitting k models to n data, we obtain p_{-i, k}: the leave-one-out predictive density of model k on data i. Stacking solves an optimization on a simplex parameter w: $\max \sum_{i=1}^n \log \sum_{k=1}^K w_k p_{-i, k}$ . It is pro forma a likelihood involving “observed data” $y_i=p_{-i, k}$ and parameter w, so we could define a posterior density $p(w|y) =\sum_{i=1}^n \log \sum_{k=1}^K w_k p_{-i, k}$ , in which the normalization constant has been omitted.

This black-box posterior is not quite right. In the least square example, the “true” posterior under the a normal generative model is roughly N(y/x, sigma/sqrt(n)), while a “model-free” posterior leads to N(y/x, 1/sqrt(n)). So maybe we can always do some power transformation $p(w|y, \lambda) = (\sum_{i=1}^n \log \sum_{k=1}^K w_k p_{-i, k})^\lambda$ . Without further calibration, it is some inference.

The conceptual barrier is that we are now treating $y_i=p_{-i, k}$ as observed data and ask the model to fit it, without having its generative model. The likelihood above is pro forma a multinomial distribution, except $y_i=\{p_{-i, k}\}_{k=1}^K$ is typically/almost sure not an integer. More importantly, p_{-i, k} exists before w_i. It is not sensible to call any model $p(p_{-i, k} | w)$ “generative”.

Nevertheless, some model evaluation only requires the likelihood. For example, we can do leave-one-out cross-validation on $y_i=p_{-i, k}$ : run importance sampling to obtain the posterior density of stacking $p(w|y_{-j}) =\sum_{i=1, i\neq j}^n \log \sum_{k=1}^K w_k p_{-i, k}$ and evaluate this leave-j-out stacking density on the j-th “data”: $E_{post, -j} \log \sum_{k=1}^K w_k p_{-j, k}$ . This is useful when the method involve tuning parameters (for example, to choose lambda above), which otherwise requires to optimize stacking n times to obtain the cross validation error, but each stacking run itself involves cross validation, so it requires n(n-1) runs of optimization in total. Now based on importance sampling approximation, we can solve this double-cross-validation in one run of MCMC sampling.

Is it always a good idea to convert an optimization objective function into a log likelihood, even if it does not correspond to a valid generative model? The advantage is that we can treat the posterior as if it comes from a valid Bayesian model, and make model evaluation (such as using approximate LOO, but we cannot do predictive check in lack of generative models) and model improvement. Also it sounds useful to replace a point optimum with a (family of) probabilistic distribution anyway, especially when the optimization objective function is flat at optimum. I do not know the answer…