Estimating by Minimizing
Three-Toed Sloth 2013-03-18
Summary:
Attention conservation notice: > 1800 words on basic statistical theory. Full of equations, and even a graph and computer code, yet mathematically sloppy.
The basic ideas underlying asymptotic estimation theory are very simple; most presentations rather cloud the basics, because they include lots of detailed conditions needed to show rigorously that everything works. In the spirit of the world's simplest ergodic theorem, let's talk about estimation.
We have a statistical model, which tells us, for each sample size \( n \), the probability that the observations \( X_1, X_2, \ldots X_n \equiv X_{1:n} \) will take on any particular value \( x_{1:n} \), or the probability density if the observables are continuous. This model contains some unknown parameters, bundled up into a single object \( \theta \), which we need to calculate those probabilities. That is, the model's probabilities are \( m(x_{1:n};\theta) \), not just \( m(x_{1:n}) \). Because this is just baby stats., we'll say that there are only a finite number of unknown parameters, which don't change with \( n \), so \( \theta \in \mathbb{R}^d \). Finally, we have a loss function, which tells us how badly the model's predictions fail to align with the data: \[ \lambda_n(x_{1:n}, m(\cdot ;\theta)) \] For instance, each \( X_i \) might really be a \( (U_i, V_i) \) pair, and we try to predict \( V_i \) from \( U_i \), with loss being mean-weighted-squared error: \[ \lambda_n = \frac{1}{n}\sum_{i=1}^{n}{\frac{{\left( v_i - \Expectwrt{V_i|U_i=u_i}{\theta}\right)}^2}{\Varwrt{V_i|U_i=u_i}{\theta}}} \] (If I don't subscript expectations \( \Expect{\cdot} \) and variances \( \Var{\cdot} \) with \( \theta \), I mean that they should be taken under the true, data-generating distribution, whatever that might be. With the subscript, calculate assuming that the \( m(\cdot; \theta) \) distribution is right.)
Or we might look at the negative mean log likelihood, i.e., the cross-entropy, \[ \lambda_n = -\frac{1}{n}\sum_{i=1}^{n}{\log{m(x_i|x_{1:i-1};\theta)}} \]
Being simple folk, we try to estimate \( \theta \) by minimizing the loss: \[ \widehat{\theta}_n = \argmin_{\theta}{\lambda_n} \]
We would like to know what happens to this estimate as \( n \) grows. To do this, we will make two assumptions, which put us at the mercy of two sets of gods.
The first assumption is about what happens to the loss functions. \( \lambda_n \) depends both on the parameter we plug in and on the data we happen to see. The later is random, so the loss at any one \( \theta \) is really a random quantity, \( \Lambda_n(\theta) = \lambda_n(X_{1:n},\theta) \). Our first assumption is that these random losses tend towards non-random limits: for each \( \theta \), \[ \Lambda_n(\theta) \rightarrow \ell(\theta) \] where \( \ell \) is a deterministic function of \( \theta \) (and nothing else). It doesn't particularly matter to the argument why this is happening, though we might have our suspicions1, just that it is. This is an appeal to the gods of stochastic convergence.
Our second assumption is that we always have a unique interior minimum with a positive-definite Hessian: with probability 1, \[ \begin{eqnarray*} \nabla {\Lambda_n}(\widehat{\theta}_n) & = & 0\\ \nabla \nabla \Lambda_n (\widehat{\theta}_n) & > & 0 \end{eqnarray*} \] (All gradients and other derviatives will be with respect to \( \theta \); the dimensionality of \( x \) is irrelevant.) Moreover, we assume that the limiting loss function \( \ell \) also has a nice, unique interior minimum at some point \( \theta^* \), the minimizer of the limiting, noise-free loss: \[ \begin{eqnarray*} \theta^* & = & \argmin_{\theta}{\ell}\\ \nabla {\ell}(\theta^*) & = & 0\\ \nabla \nabla \ell (\theta^*) & > & 0 \end{eqnarray*} \] Since the Hessians will be important, I will abbreviate \( \nabla \nabla \Lambda_n (\widehat{\theta}_n) \) by \( \mathbf{H}_n \) (notice that it's a random variable), and \( \nabla \nabla \ell (\theta^*) \) by \( \mathbf{j} \) (notice that it's not random).
These assumptions about the minima, and the derivatives at the minima, place us at the mercy of the gods of optimization.
To see that these assumptions are not empty, here's an example. Suppose that our models are Pareto distributions for \( x \geq 1 \), \( m(x;\theta) = (\theta - 1) x^{-\theta} \). Then \( \lambda_n(\theta) = \theta \overline{\log{x}}_n - \log{(\theta - 1)} \), where \( \overline{\log{x}}_n = n^{-1} \sum_{i=1}^{n}{\log{x_i}} \), the s