Inference and statistical physics

Windows On Theory 2021-04-02

Scribe notes by Franklyn Wang

Previous post: Robustness in train and test time Next post: TBD. See also all seminar posts and course webpage.

lecture slides (pdf)lecture slides (Powerpoint with animation and annotation)video

Digression: Frequentism vs Bayesianism

Before getting started, we’ll discuss the difference between the two dominant schools of thought in probability: Frequentism and Bayesianism.

Frequentism holds that the probability of an event is the long-run average proportion of something that happens many, many times, so a coin has 50% probability of being heads if on average half of the flips are heads. One consequence of this framework is that a one-off event like Biden winning the election doesn’t have any probability as by definition they can only be observed once.

Bayesians reject this model of the world. For example, a famous Bayesian, Jaynes, even wrote that “probabilities are frequencies only in an imaginary universe.”

While these branches of thought are different, generally the answers to most questions are the same, so the distinction will not matter for this class. However, these branches of thought inspire different types of methods, which we now discuss.

For example, suppose that we have a probability distribution D that we can get samples from in the form of x. In statistics, our goal is to calculate a hypothesis h \in \mathcal{H} which minimizes \mathcal{L}_{D}(h) (possibly a loss function, but any minimand will do) where we estimate \mathcal{L}_D with our samples x.

A frequentist does this by defining a family \mathcal{D} of potential distributions, and we need to find a transformation \vec{x} \mapsto h(\vec{x}) which minimizes the cost for all D \in \mathcal{D}, where

These equations amount to saying that h is minimizing the worst-case loss over all distributions.

By contrast, a Bayesian approaches this task by assuming that D = D_{\vec{w}} where \vec{w} \sim P are latent variables sampled from a prior.

Then, let Q_{\vec{x}}(\vec{w}) = P(w | D_{\vec{w}} = x) be the posterior distribution of w conditioned on the observations \vec{x}. We now minimize

In fact, Bayesian approaches extend beyond this, as if you can sample from a posterior you can do more than just minimize a loss function.

In Bayesian inference, chosing the prior P is often very important. One classical approach is a maximum entropy prior, because it’s the least informative (hence, the most entropy) and requires the fewest assumptions.

These two minimization equations con sometimes lead to identical results, but often in practice they work out differently once we introduce computational constraints into the picture. In the frequentist approach, we generally constrain the family of mappings \vec{x} \rightarrow \vec{h} to be efficiently computable. In the Bayesian approach, we typically approximate the posterior instead and either use approximate sampling or a restricted hypothesis class for the posterior to be able to efficiently sample from it.

Part 1. An introduction to Statistical Physics

Compare water and ice. Water is hotter, and the molecules move around more. In ice, by contrast, the molecules are more stationary. When the temperature increases, the objects move more quickly, and when they decrease the objects have less energy and stop moving. There are also phase transitions, where certain temperatures cause qualitative discontinuities in behavior, like solid to liquid or liquid to gas.

See video from here:

An atomic state can be thought of as x \in {0, 1}^{n}. Now, how do we represent the system’s state? The crucial observation that makes statistical physics different from “vanilla physics” is the insight to represent the system state as a probability distribution p supported on {0, 1}^{n}, rather than in a single state.

Each atomic state has a negative energy function (which we will think of as a utility function) W: {0, 1}^{n} \rightarrow \mathbb{R}. In some sense, the system “wants” to have a high value of \mathbb{E}_{x \sim p}[ W(x)]. In addition, when the temperature is high, the system “wants” to have a high value of entropy.

Thus, an axiom of thermodynamics states that to find the true distribution, we need only look at the maximizer of

p = \arg\max_{q} \mathbb{E}_{x \sim q}[W(x)] + \tau \cdot H(q).

That is, it is the probability distribution which maximizes a linear combination of the expected negative energy and the entropy, with the temperature controlling the coefficient of entropy in this combination (the higher the temperature, the more the system prioritizes having high entropy).

The variational principle, which we prove later, states that the q which maximizes this satisfies

p(x) \propto \exp(\tau^{-1} \cdot W(x))

which is known as the Boltzmann Distribution. We often write this with a normalizing constant, so p(x) = \exp(\tau^{-1} \cdot W(x) - A_{\tau}(W)), where A_{\tau}(W) = \log (\int \exp(\tau^{1} \cdot W(x))).

Before proving the variational principle, we will go through some examples of statistical physics.

Example: Ising Model

In the Ising model, we have n magnets that are connected in a square grid. The atomic state is each x \in {\pm 1}^{n}, which represents “spin”. The value of W(x) is J\sum_{i \sim j} x_i x_j + J'\sum_{i} x_i, where i \sim j denotes that i and j are adjacent magnets. This encourages adjacent magnets to have aligned spins. One important concept, which we will return to later, is that the values of {x_i, x_ix_j} are sufficient to calculate the value of W(x) (these are known as sufficient statistics). Furthermore, if we wanted to calculate \mathbb{E}[W(x)], it would be enough to calculate the values of \mathbb{E}[x_i] and \mathbb{E}[x_i x_j] and then apply linearity of expectation. An illustration of an Ising model that we cool down slowly can be found here:

and a video can be found here.

This is what a low-temperature Ising model looks like – note that W(x) is high because almost all adjacent spins are aligned (hence the well-defined regions).

The Sherrington-Kirkpatrick model is a generalization of the Ising model to a random graph, which represents a disordered mean field model. Here, x \in {\pm 1}^{n} still represents spin, and W(x) = \sum w_{i, j} x_i x_j where w_{i, j} \sim \mathcal{N}(0, 1). The SK-model is deeply influential in statistical physics. We say that it is disordered because the utility function W is chosen at random and that it is a mean field model because, unlike the Ising model, there is no geometry in the sense that every pair of individual variables x_i and x_j are equally likely to be connected.

Our third example is the posterior distribution, where x is a hidden variable with a uniform prior. We now makes, say, k independent observations O_1, O_2, \ldots O_k. The probability of x is given by

p(x) \propto p(x\text{ satisfies } O_1) \cdot p(x\text{ satisfies } O_2) \ldots p(x\text{ satisfies } O_k)

so p(x) \propto \exp(\log \sum \mathbb{P}(x\text{ satisfies } O_i)). Note that the RHS is easy to calculate, but in practice the normalizing factor (also known as the partition function) can be difficult to calculate and often represents a large barrier, in the sense that computing the partition function makes many of these questions far easier.

Proof of the Variational Principle

Now, we prove the variational principle, which states that if p(x) = \exp(\tau^{-1} \cdot W(x) - A_\tau(W)), where A_\tau(W) = \log \int \exp(\tau^{-1} \cdot W(x)) is the normalizing factor, then

p = \arg\max_{q} \mathbb{E}_{x \sim q} W(x) + \tau \cdot H(q)

Before proving the variational principle, we make the following claim: if p is defined as p(x) = \exp(\tau^{-1} \cdot W(x) - A_\tau(W)) then

Proof of claim: Write H(p) = -\int \log p(x) p(x) dx = -\int (\tau^{-1} \cdot W(x) - A_{\tau}(W)) p(x) dx (where we plugged in the definition of p(x) in \log p(x)).

We can now rewrite this as

H(p) = -\tau^{-1} \cdot \mathbb{E}_{x \sim p} W(x) + \mathbb{E}_{x \sim p} A_{\tau}(W)

which means that

H(p) = -\tau^{-1} \cdot \mathbb{E}_{x \sim p} W(x) + A{\tau}(W)

using the fact that A_\tau(W) is a constant (independent of x). Multiplying by \tau and rearranging, we obtain the claim. \blacksquare

Proof of variational principle: Given the claim, we can now prove the variational principle. Let q be any distribution. We write

where the first equation uses our likelihood expression. This implies that \mathbb{E}_{x \sim q} W(x) + \tau \cdot H(q) is maximized at q = p, as desired. \blacksquare

Remarks:

  • When \tau = 1, we have A(W) = \max_{q} H(q) + \mathbb{E}_{x \sim q} W(x).
  • In particular, for every value of q, we have that

which we’ve seen before! Note that equality holds when q = p, but to approximate A we can simply take more tractable values of q.

In many situations, we can compute W(x), but can’t compute A_{\tau}(W), which stifles many applications. One upshot of this, though, is that we can calculate ratios of p(x) and p(x'), which is good enough for some applications, like Markov Chain Monte Carlo.

Markov Chain Monte Carlo (MCMC)

An important task in statistics is to sample from a distribution p, for very complicated values of p. MCMC does this by constructing a Markov Chain whose stationary distribution is p. The most common instantiation of MCMC is Metropolis-Hastings, which we now describe. First, we assume that there exists an undirected graph on the states x, so that x \sim x' iff x' \sim x and that for x \sim x', the probabilities of x and x' are similar in the sense that W(x)/W(x') is neither too small nor too large. Then the Metropolis-Hastings algortihm is as follows.

Metropolis-Hastings Algorithm:

  1. Draw x_0 at random.
  2. For i = 1, 2, \ldots, choose an arbitrary x' \sim x, and let

Then, x_t eventually is distributed as p! To show that this samples from p, we show that the stationary distribution is of this Markov chain is p. In this case, this turns out to be easy, since we can check the detailed balance conditions p(x' | x)p(x) = \min{(p(x), p(x'))} = p(x | x')p(x') so p is the stationary distribution of the Markov Chain.

The stationary distribution is often unique, so we’ve proven that this samples from p eventually. In MCMC algorithms, however, often an important question is how fast we converge to the stationary distribution. Often this is rather slow, which is especially dissappointing because if it were faster many very difficult problems could be solved much more easily, like generic optimization. Indeed, there are examples where convergence to the stationary distribution would take exponential time.

Note: One way to make MCMC faster is to let x' be really close from x, because the likelihood ratio will be closer to 1 and x_i will spend less time stuck at its original location. There’s a tradeoff, however. If x' is too close to x, then the chain will not mix as quickly, where mixing is the convergence of x to the stationary distribution, because generally to mix the value x_t must get close to every point, and making each of x‘s steps smaller will make that more difficult.

Applications of MCMC: Simulated Annealing

One application of MCMC is in simulated annealing. Suppose that we have W: {0, 1}^{n} \rightarrow \mathbb{R}, and we want to find x = \arg \max_{x} W(x). The most direct attempt at solving this problem is creating a Markov Chain that simply samples from x \in \arg\max_{x} W(x). However, this is impractical, at least directly. It is like we have a domino that is far away from us, and we can’t knock it down. What do we do in this case? We put some dominoes in between us!

To this end, we now create a sequence of Markov chains supported on x, whose stationary distribution p_{\tau}(x) \sim W(x)^{1/\tau}, as \tau gets smaller and smaller. This corresponds to cooling a system from a high-temperature to a low-temperature. Essentially, we want to sample from p_0(x).

Simulated annealing lets us begin by sampling from p_{\infty}, which is uniform on the support. Then we can slowly reduce \tau from \infty to 0. When cooling a system, it will be helpful to think of two stages. First, a stage in which the object cools down. Second, a settling stage in which the system settles into a more stable state. The settling stage is simulated via MCMC on p_{\tau}(x), so the transition probability from x to x' is \min(1, \text{exp}(\tau^{-1} \cdot (W(x') - W(x)))).

Simulated Annealing Algorithm:

  1. Cooling Begin \tau at \infty (or very large), and lower the value of \tau to zero according to some schedule. 1a. Settling Now, repeatedly move from x to x' with probability \min(1, \text{exp}(\tau^{-1} \cdot (W(x') - W(x)))).

Since simulated annealing is inspired by physical intuition, it turns out that its shortcomings can be interpreted physically too. Namely, when you cool something too quickly it becomes a glassy state instead of a ground state, which often causes simulated annealing to fail, and for this algorithm to get stuck in local minima. Note how this is fundamentally a failure mode with MCMC – it can’t mix quickly enough.

See this video for an illustration of simulated annealing.

Part 2: Bayesian Analysis

Note that the posterior of \vec{w} conditioned on x_1, \ldots x_n is

p(\vec{w} | x_1, \ldots x_n) = \frac{p(\vec{w})}{p(\vec{x})} \cdot p(x_1 | \vec{w}) p(x_2 | \vec{w}) \ldots p(x_n | \vec{w}) \propto \exp\left(\sum X_i(\vec{w})\right)

We now have p_{W}(x) = \exp(W(x) - A(W)), an exponential distribution. Now suppose that W(x) = \langle w, \hat{x} \rangle, where \hat{x} \in \mathbb{R}^{m} are the sufficient statistics of x. For example, the energy function could follow that of a tree, so W(x) = \sum_{(i, j) \in T} w_{i, j} x_i x_j. If you want to find the expected value of $latex \mathbb{E}{x \sim p{W}}[W(x)]&bg=ffffff$ its enough to know $latex \mu = \mathbb{E}{x \sim p{W} }[\hat{x}]&bg=ffffff$. (By following a tree, we mean that the undirected graphical model of the distribution is that of a tree.)

Sampling from a Tree-Structured Distribution

Now how do we sample from a tree-structured distribution? The most direct way is by calculating the marginals. While marginalization is generally quite difficult, it is much more tractable on certain types of graphs. For a sense of what this entails, suppose that we have the following tree-structured statistical model:

, whose joint PDF is W(x) = \exp(\sum_{i, j \in E} -w_{i, j} (x_i - x_j)^2) where x \in {0, 1}^{n}. (The unusual notation is chosen so that the relationships between the marginals is clean. The log-density in question is a Laplacian Quadratic Form.)

Then one can show that if the marginal of x_3 is \mu_3 = \mathbb{P}(x_3 = 1), then we have that

\mu_6 = \mu_3 \cdot \frac{e^{w_{3,6}}}{1 + e^{w_{3, 6}}} + (1 - \mu_3) \cdot \frac{1}{1 + e^{w_{3, 6}}}

This will give six linear equations in six unknowns, which we can solve. Once we can marginalize a variable, say, x_6, we can then simply sample from the marginal, then sample from the conditional distribution on x_6 to sample from the entire distribution. This method is known as belief propogation.

This algorithm (with some modifications) also works for general graphs, but what it represents on general graphs is not the exact marginal, but rather an extension of that graph to a tree.

General Exponential Distributions

Now let’s try to develop more theory for exponential distributions. Let the PDF of this distribution be p_W(x) = \exp(\langle w, \hat{x} \rangle - A(w)). A few useful facts about these distributions that often appear in calculations: \nabla A(w) = \mu and $latex \nabla^2 (A(w)) = \text{Cov}{p{W}}(x) \succeq 0&bg=ffffff$.

By the variational principle, we have that among all distributions with the same sufficient statistics as a given Boltzmann distribution, the true one maximizes the entropy, so

p_w = \arg\max_{\mathbb{E}_{q} \hat{x} = \mu} H(q).

An analagous idea gives us \mu = \arg \max_{H(\mu) \ge H(p_{w})} \langle w, \mu \rangle,

so p_w is the maximum entropy distribution consistent with observations \mu.

The important consequence of these two facts is that in principle, w determines \mu and vice versa. (Up to fine print of using minimal sufficient statistics, which we ignore here.) We will refer to w as the “canonical parameter space” and \mu as the “mean representation space”. Now note that the first equation gives us a way of going from \mu to w, and the second equation gives us a way to go from w to \mu, at least information-theoretically.

Now, how do we do this algorithmically? Using w, if were able to sample from p_w (which is generally possible if we can evaluate A(w) then we can estimate the mean \mu through samples. We can also obtain \mu from A(w) by estimating \nabla A.

On the other hand, if you have \mu, to obtain w you first consider A^{\ast}(\mu) = \sup_{w \in \Omega} {\langle \mu, w \rangle - A(w)} and then note that the desired value is \arg\max_{w \in \Omega}{\langle \mu, w \rangle - A(w)}, so solving this problem boils down to estimating \nabla A^{\ast} efficiently, because setting it to zero will give us w.

Thus going from w to \mu requires estimating \nabla A, whereas going from \mu to w requires estimating \nabla A^{\ast}.

When p is a posterior distribution, the observed data typically gives us the weights w, and hence the inference problem becomes to use that to sample from the posterior.

Examples of Exponential Distributions

There are many examples of exponential distributions, which we now give.

  • High-Dimensional Normals: x\in \mathbb{R}^{d}, W(x) = -(x - \mu)^{\top} \Sigma^{-1} (x-\mu)
  • Ising Model: x\in {0, 1}^{d}, W(x) = \sum w_i x_I + \sum_{i, j \in E} w_{i, j} x_i x_j. A sufficient statistic for these is \hat{x} = (x, xx^{\top}), a fact which we will invoke repeatedly.
  • There’s many many more, including Gaussian Markov Random Fields, Latend Dirchlet Allocation, and Mixtures of Gaussians.

Going from canonical parameters to mean parameters

Now, we show how to go from w to \mu in a special case, namely in the mean-field approximation. The mean field approximation is the approximation of distributions by product distributions over x_1, \ldots, x_d \in {0, 1}^{d}, for which

\mathbb{P}(x_1, \ldots x_n) = \mathbb{P}(x_1)\mathbb{P}(x_2) \ldots \mathbb{P}(x_n).

Recall that the partition function can be computed as

A(w) = \max_{q \in \mathcal{Q}} \langle w, \mathbb{E}_{q} \hat{x} \rangle + H(q).

where \mathcal{Q} is the set of all probability distributions. If we instead write \mathcal{Q} as the set of product distributions (parametrized by \mu, or the probability that each variable is 1), we get

A(w) \ge \max_{\mu \in \mathcal{Q}} \sum w_i \mu_i + \sum w_{i,j} \mu_i \mu_j + \sum H(\mu_i),

where H(\mu_i) = -\mu_i \log \mu_i - (1 - \mu_i) \log{1 - \mu_i} and it now suffices to maximize the right hand function over the set K = { \mu\in [0,1]^d | \sum \mu_i = 1 }.

We can generally maximize concave functions over a convex set such as K. Unfortunately, this function is not concave. However, it is concave in every coordinate. This suggests the following algorithm: fix all but one variable and maximize over that variable, and repeat. This approach is known as Coordinate Ascent Variational Inference (CAVI), and its pseudocode is given below.

CAVI Algorithm

  1. Let \mu = 1/2 \cdot (1, 1, \ldots 1)
  2. Repatedly choose values of j in [n] (possibly at random, or in a loop.) 2a. Update \mu_{j} = \arg\max_{x_j} (f(\mu_{-j}, x_j)) (where \mu_{-j} represents the non-j coordinates of \mu)

Part 3. Solution landscapes and the replica method

This part is best explained visually. Suppose that we have a Boltzman distribution. In the infinite temperature limit the this is the uniform distribution, distributed over the entire domain. As you decrease the temperature, the support’s size decreases. (Note: The gifs below are just cartoons. These pretend as if the probability distribution is always uniform over a subset. Also, in most cases of interest for learning, only higher order derivatives of entropy will be discontinuous.)

Sometimes there’s a discontinuity in entropy, and the entropy “suddenly drops” in a discontinuity. Often the entropy function itself is continuous, but the derivatives are not continuous at that point – a higher order phase transition.

Sometimes the geometry of the solution space undergoes a phase transition as well, with it “shattering” to several distinct clusters.

The replica method

Note: We will have an extended post on the replica method later on.

If you sampled x_1, \ldots x_n from p_{w}, where p_w is a high-dimensional distribution, you’d expect the distances to all be approximately the same. The overlap matrix, or

\begin{bmatrix} \langle x_1, x_1 \rangle & \cdots & \langle x_1, x_n \rangle \ \vdots & \ddots & \vdots \ \langle x_n, x_1 \rangle & \cdots & \langle x_n, x_n \rangle \end{bmatrix}

we approximate as

\begin{bmatrix} 1 & \cdots & q \ \vdots & \ddots & \vdots \ q& \cdots & 1 \end{bmatrix}

where q is some constant.

Suppose w comes from a probability distribution W. Then a very common problem is computing

\mathbb{E}_{w}[A(w)] = \mathbb{E}_{w}\left[ \log \int \exp(\langle w, \hat{x} \rangle)\right],

which is the expected free energy. However, it turns out to be much easier to find \log (\int E_{w} \exp(\langle w, \hat{x} \rangle)). So here’s what we do. We find

\mathbb{E}_{w} [A(w)] = \lim{n \rightarrow 0} \frac{\mathbb{E}_{w} [\text{exp}(n \cdot A(w))] - 1}{n}

This should already smell weird, because n is going to zero, an unusal notational choice. We can now write this as \mathbb{E}{w} A(w) = \lim_{n \rightarrow 0} \frac{\mathbb{E}_{w}\left( \int{x} \exp(\langle w, \hat{x} \rangle)\right)^{n} - 1}{n}

which we can write as

\lim_{n \rightarrow 0} \frac{\int_{x_1, \ldots x_n} \mathbb{E}_{w} \exp(\langle w, \hat{x}_1 + \cdots + \hat{x}_n \rangle) - 1}{n}

Now these x_i‘s represent the replicas! Now, we’d hope \psi(n) = \int_{x_1, \ldots x_n} \mathbb{E}_w \exp(\langle w, \hat{x}_1 + \hat{x}_2 + \ldots + \hat{x}_n \rangle) is an analytic function, and we can now write this as \frac{\psi(n) - 1}{n} as n goes to zero.

Generally speaking, \mathbb{E}_{w} [\exp(w, \hat{x}_1 + \cdots + \hat{x}_n)] only depends on overlaps of Q, so we often guess the value of Q and calculate this expectation.

Examples:

We’ll now give an example of how this is useful. Consider a spiked matrix/tensor model, where we observe Y so that Y = \lambda S + N, where S is the signal and N is the noise. Thus here we have

p(S', N' | Y) \propto \exp(\beta \langle Y - \lambda S', N' \rangle^2) \cdot p(S'),

and we want to analyze \mathbb{E}_{S' \sim p(\cdot | Y)} \langle S, S' \rangle^2 as a function of \lambda, which can be done by the replica method and exhibits a phase transition.

Other examples include Teacher-student models, where X, Y = (X, f_S(X) + N). Then, a recent work calculates the training losses and classification errors when training on this dataset, which closely match empirical values.

Symmetry breaking

Lastly, we’ll talk about replica symmetry breaking. Sometimes, discontinuities don’t just make the support smaller, but actually break the support into many different parts. For example, at this transition the circle becomes the three green circles.

When this happens, we can no longer make our overlap matrix assumption, as there are now asymmetries between the points. This leads to the overlap matrix being striped, in the fashion one would anticipate.

Sometimes, the support actually breaks into infinitely many parts, in a phenomenon called full replica symmetry breakng.

Finally, some historical notes. This “plugging in zero” trick was introduced by Parisi approximately 30 years ago, receiving a standing ovation at the ICM. Since then, some of those conjectures have been rigorously formalized, but many haven’t. It is still very impressive that Parisi was able to do it.