ChatGPT o1-preview can code Stan

Statistical Modeling, Causal Inference, and Social Science 2024-10-22

This is Bob.

Yes, but can it Stan?

The first few instantiations of ChatGPT haven’t been so good at Stan. This is perhaps not surprising, because there’s relatively little written about Stan on the web compared to, say, Python, C++, or R.

Impressive, whatever you call it

It’s a whole new ball game with the new o1-preview model rolled out by OpenAI. I was tipped off to this fact by a blog post from Keith Goldfeld (of NYU Population Health) on the ouR data generation blog, Can ChatGPT help construct non-trivial statistical models? An example with Bayesian ‘random’ splines. It’s well worth reading.

A lot of our readers will object to the tagline from OpenAI for o1-preview, “uses advanced reasoning.” Comment season is open. Whether you want to call this “reasoning” or “understanding” or pick your own term, the ability of OpenAI’s new o1-preview model is quite remarkable. It’s getting better at generalizing and working from the data it has. The o1-preview model gives you a protocol as it uses what AI researchers call “chain of thought” (yes, I’m teeing up another one for the doubters, to continue with the sports metaphors). You have to sign up and pay and then it’s rate limited, because it’s spending on the order of 15 to 150 seconds to answer a query as it goes through iterative refinement on its output. I’m here to tell you that the wait’s worth it, and that’s also what I’m hearing from people around here who lean on it for complicated code refactors.

Bayesian workflow is universal

I am about to do a tutorial tomorrow at Flatiron Institute on Bayesian workflow. I’d like to help all the scientists here understand how to apply Bayesian workflow using SBC, PPC, LOO, etc., no matter how they fit their models (Monte Carlo methods, variational inference, amortized inference, simulation-based inference, etc.).

Galileo’s inclined plane experiment

I thought it would help the presentation to have a concrete science example. I figured Galileo’s inclined plane experiment as a way to estimate the gravitational constant would be fun. Galileo set up a ball on an inclined plane, then measured how long it took to move a given distance. With that experimental data, we can fast forward to Newton and set up a statistical model where we use Galileo’s data to estimate the gravitational constant (on earth).

ChatGPT o1-preview cracks physics-based Stan

So I asked ChatGPT o1-preview to derive the physics for me, then to write the Stan model. I’d already written the Stan model at that point—it’s one thing I can still do better than GPT. But the one it wrote looks good. I went and checked some online physics tutorials as I can’t remember my high school physics well enough to derive it myself. It decided to use the inertial formula for a solid sphere with no friction. It got all the algebra right as far as I can tell. This is much better than earlier versions of ChatGPT did at algebra.

Show, don’t tell

Here’s the transcript from my session in case you’d like to dive deeper on how I’m prompting it: ChatGPT o1-preview simulates Galileo’s inclined plane.

Keith Goldfeld’s blog post also goes over in detail how he used it to generate code for splines.

Just the code

Here’s the Python simulation, inference, and plotting code that I put together from the GPT output.

import numpy as np

def simulate_times(N, sigma, length, height):
    g = 9.81  # gravitational acceleration in m/s^2
    s = np.sqrt(length**2 + height**2)
    h = height
    t = np.sqrt((14 * s**2) / (5 * g * h))
    t_obs = np.random.lognormal(mean=np.log(t), sigma=sigma, size=N)
    return {
        'length': length, 'height': height,
        'N': N, 't_obs': t_obs
    }

data = simulate_times(N=100, sigma=0.1, length=5, height=2.5)

import cmdstanpy as csp
m = csp.CmdStanModel(stan_file='galileo.stan')
draws = m.sample(data = data)
print(draws.summary())

import pandas as pd
import plotnine as pn

df = pd.DataFrame('g': draws.stan_variables('m'),
                      'sigma': draws.stan_variables('sigma'))
df = pd.DataFrame(draws.stan_variables())
plot = (pn.ggplot(df, pn.aes(x='g', y='sigma'))
            + pn.geom_point()
            + pn.theme(aspect_ratio=1)
            + pn.geom_hline(yintercept=0.10, color='blue', size=1)
            + pn.geom_vline(xintercept=9.81, color='red', size=1)
            )
plot.show()

Here’s the Stan program that I wrote—GPT’s is pretty darn close, but my priors are tighter. I chose lognormal error just to keep everything positive, not because multiplicative (proportional) error is right for measuring time. I suspect the errors are more naturally additive based on the precision of the measuring device.

data {
  real length, height; 
  int N;  vector[N] t_obs;
}
transformed data {
  real s = sqrt(length^2 + height^2);
  real h = height;
}
parameters {
  real g;      // gravity accel in m/s^2
  real sigma;  // measurement error
}
model {
  real log_t_true = 0.5 * (log(14) + 2 * log(s) - log(5) - log(g) - log(h));
  t_obs ~ lognormal(log_t_true, sigma);
  sigma ~ exponential(1);
  g ~ lognormal(log(10), 0.25);
}

I threw in tighter priors than GPT—these are weakly informative. A more general measurement error model would allow for lack of calibration (i.e., measurement bias), and use a constant offset as well as an error term.