Notes on "Collective Stability in Structured Prediction: Generalization from One Example" (or: Small Pieces, Loosely Joined)
Three-Toed Sloth 2014-06-24
Summary:
Attention conservation notice: 2700+ words, expounding a mathematical paper on statistical learning theory. Largely written months ago, posted now in default of actual content.
For the CMU statistical learning theory reading group, I decided to present this:
- Ben London and Bert Huang and Benjamin Taskar and Lise Getoor, "Collective Stability in Structured Prediction: Generalization from One Example", in Sanjoy Dasgupta and David McAllester (eds.), Proceedings of the 30th International Conference on Machine Learning [ICML-13] (2013): 828--836
- Abstract: Structured predictors enable joint inference over multiple interdependent output variables. These models are often trained on a small number of examples with large internal structure. Existing distribution-free generalization bounds do not guarantee generalization in this setting, though this contradicts a large body of empirical evidence from computer vision, natural language processing, social networks and other fields. In this paper, we identify a set of natural conditions — weak dependence, hypothesis complexity and a new measure, collective stability — that are sufficient for generalization from even a single example, without imposing an explicit generative model of the data. We then demonstrate that the complexity and stability conditions are satisfied by a broad class of models, including marginal inference in templated graphical models. We thus obtain uniform convergence rates that can decrease significantly faster than previous bounds, particularly when each structured example is sufficiently large and the number of training examples is constant, even one.
The question being grappled with here is how we can learn from one example, really from one realization of a stochastic process. Our usual approach in statistics and machine learning is to assume we have many, independent examples from the same source. It seems very odd to say that if we see a single big, internally-dependent example, we're as much in the dark about the data source and its patterns as if we'd observed a single one-dimensional measurement, but that's really all a lot of our theory can do for us. Since we know that animals and machines often can successfully learn generalizable patterns from single realizations, there needs to be some explanation of how the trick is turned...
This paper is thus relevant to my interests in dependent learning, time series and spatio-temporal data, and networks. I read it when it first came out, but I wasn't at all convinced that I'd really understood it, which was why I volunteered to present this. Given this, I skipped sections 6 and 7, which specialize from pretty general learning theory to certain kinds of graphical models. It's valuable to show that the assumptions of the general theory can be realized, and by a non-trivial class of models at that, but they're not really my bag.
At a very high level, the strategy used to prove a generalization-error bound here is fairly familiar in learning theory. Start by establishing a deviation inequality for a single well-behaved function. Then prove that the functions are "stable", in the sense that small changes to their inputs can't alter their outputs too much. The combination of point-wise deviation bounds and stability then yields concentration bounds which hold uniformly over all functions. The innovations are in how this is all made to work when we see one realization of a dependent process.
Weak Dependence and a Point-wise Deviation Bound
The data here is an $n$-dimensional vector of random variables, $Z = (Z_1, Z_2, \ldots Z_n)$ is an $n$-dimensional object. N.B., $n$ here is NOT number of samples, but the dimensionality of our one example. (I might have preferred something like $p$ here personally.) We do not assume that the $Z_i$ are independent, Markov, exchangeable, stationary, etc., just that $Z$ obeys some stochastic process or other.
We are interested in functions of the whole of $Z$, $g(Z)$. We're going to assume that they have a "bounded difference" property: that if $z$ and $\zprime$ are two realizations of $Z$, which differ in only a single coordinate, then $|g(z) - g(\zpr