Some questions and ideas about initialization for ADVI

Statistical Modeling, Causal Inference, and Social Science 2026-01-16

Matthias Müller-Schrader writes:

I’ve recently read your ADVI paper in JMLR with great enjoyment.

However, I stumbled over one important detail:

In case of the full-rank Gaussian, the Cholesky factorization ensures that Sigma is positive-semi-definite for an unconstrained lower-triangular matrix L. But this doesn’t guarantee positive-definiteness as one of the diagonal entries (=eigenvalues) of L could be zero, rendering L and Sigma being not invertible. As the inverse of L is needed to compute the gradients (eq. 9), there are some constraints on the variational parameters (diagonal entries of L have to be non-zero) and the optimization might suffer some problems in the neighborhood of these points.

Could that be a problem or did you even observe problems related to that? Or is there reason to believe that the optimization will never reach this areas of parameter space if initialized properly?

Alp Kucukelbir, the first author on the ADVI paper (in statistics, the first author is typically the person who writes the paper; we don’t assume a default alphabetical order) replied:

Thank you for your email, which I also read with great enjoyment.

Implementers of statistical algorithms often follow best practices, such as adding small constants to avoid situations like what you describe. That said, your intuition around the importance of initialization is key — there has been considerable subsequent study of this topic. I suggest the following two manuscripts as follow ups:

https://www.jmlr.org/papers/volume23/21-0889/21-0889.pdf https://studenttheses.uu.nl/bitstream/handle/20.500.12932/47140/MSc-Thesis-Gertjan-Brouwer-9386386.pdf

To which Matthias responded:

Thank you for the follow ups – I’ll have a look at them.

Could the problem be also avoided by reparametrizing the diagonal entries/eigenvalues of L as \omega_i = log(\sigma_i) (as done in the mean-field approximation/corresponding to the Log-Cholesky parametrization in Pinhero & Bates 1995) and leaving the off-diagonal entries unconstrained?

The new expression for the gradient should be straightforward to compute. This would give an unconstrained (and even unique) parametrization of L which could have two advantages: (1) Avoiding the ambiguity could increase stability [situations where the optimization jumps between e.g. + and – the diagonal entry/rows of L are impossible], (2) It would be very hard for the optimization algorithm to enter these problematic areas of parameter space because we shift them to a neighborhood of -infinity. Could that be favorable to adding the small constants, because while they avoid dividing by zero, they could still have some side effects?

I haven’t thought about initialization for ADVI in awhile, but from when I was thinking about it, several years ago, I remember that initial values can be important, especially when parameters are far from unit scale (that is, when the mass of the target distribution is orders of magnitude away from 1).