Topological Inference

Normal Deviate 2013-03-31

We uploaded a paper called Statistical Inference For Persistent Homology on arXiv. (I posted about topological data analysis earlier here.) The paper is written with Siva Balakrishnan, Brittany Fasy, Fabrizio Lecci, Alessandro Rinaldo and Aarti Singh.

The basic idea is this. We observe data {{\cal D} = \{X_1,\ldots, X_n\}} where {X_1,\ldots, X_n \sim P} and {P} is supported on a set {\mathbb{M}}. We want to infer topological features of {\mathbb{M}}. For example, here are data from a donut:

Donut

The top left plot is the set {\mathbb{M}}. The top right plot shows a sample from a distribution on {\mathbb{M}}. The bottom left plot is a union of balls around the data. The bottom right is something called a Čech complex which we can ignore here. The donut has one connected component and one hole. Can we recover this information from the data? Consider again the bottom left plot which shows

\displaystyle  \hat S_\epsilon = \bigcup_{i=1}^n B(X_i,\epsilon)

where {B(X_i,\epsilon)} is a ball of size {\epsilon} centered at {X_i}. When {\epsilon} is just right, {\hat S_\epsilon} will have one connected component and one hole, just like {\mathbb{M}}. Here are plots of {\hat S_\epsilon = \bigcup_{i=1}^n B(X_i,\epsilon)} for various values of {\epsilon}:

Union

As we vary {\epsilon}, topological features (connected components, holes etc) will be born and then die. Each feature can be represented by a pair {z=({\rm birth},{\rm death})} indicating the birth time and death time of that feature. A persistence diagram {{\cal P}} is a plot of the points {z}. Small features will have death times close to their birth times. In other words, small features correspond to a point {z} near the diagonal. Here is an example of a persistence diagram:

Persist

Informally, topological noise correspond to points near the diagonal and topological signal corresponds to points far from the diagonal. The goal of our paper is to separate noise from signal, in a statistical sense. Here are some details.

The distance function to {\mathbb{M}} is

\displaystyle  d_{\mathbb{M}}(x) = \inf_{y\in \mathbb{M}} ||x-y||

and the distance function to the data is

\displaystyle  \hat d(x) = \inf_{X_i\in {\cal D}} ||x-X_i||.

The persistence diagram {{\cal P}} of {\mathbb{M}} represents the evolution of the topology of the lower level sets {\{x:\ d_{\mathbb{M}}(x)\leq \epsilon\}} as a function of {\epsilon}. The estimate {\hat{\cal P}} of {{\cal P}} is the persistence diagram based on the lower level sets of {\hat d}, that is, {\{x:\ \hat d(x) \leq \epsilon\}.} But notice that {\{x:\ \hat d(x) \leq \epsilon\}} is precisely equal to {\hat S_\epsilon}:

\displaystyle  \hat S_\epsilon = \{x:\ \hat d(x) \leq \epsilon\}.

To make inferences about {{\cal P}} from {\hat{\cal P}} we need to infer how far apart these two diagrams are. The bottleneck distance {W_\infty(\hat{\cal P},{\cal P})} measures the distance between the estimated and true persistence diagram. Basically, this measures how much we have to move the points in {\hat{\cal P}} to match the points in {{\cal P}} as in this plot:

matching

Our paper shows several ways to construct a bound {c_n = c_n(X_1,\ldots, X_n)} so that

\displaystyle  \limsup_{n\rightarrow\infty} \mathbb{P}(W_\infty(\hat{\cal P},{\cal P}) > c_n) \leq \alpha.

Then we add a strip of size {c_n} (actually, of size {\sqrt{2}c_n}) to the persistence diagram. Points in the strip are declared to be “noise” as in this example:

stylized

To bound {W_\infty(\hat{\cal P},{\cal P})} we use the fact that {W_\infty(\hat{\cal P},{\cal P}) \leq H({\cal D},\mathbb{M})} where {H} is the Hausdorff distance which is defined by

\displaystyle  H(A,B) = \inf \Bigl\{ \epsilon:\ A \subset B \oplus \epsilon\ \ \ {\rm and}\ \ \ B \subset A \oplus \epsilon\Bigr\}

where

\displaystyle  A \oplus \epsilon = \bigcup_{x\in A}B(x,\epsilon)

and {B(x,\epsilon)} is a ball of size {\epsilon} centered at {x}.

To summarize: to find a confidence interval for {W_\infty(\hat{\cal P},{\cal P})} it suffices to get a confidence interval for {H({\cal D},\mathbb{M})}.

One approach for bounding {H({\cal D},\mathbb{M})} is subsampling. We draw random subsamples, {\Omega_1,\ldots,\Omega_N}, each of size {b}, from the original data. Here {b=b_n} satisfies {b_n = o(n)} and {b_n \rightarrow \infty} as {n\rightarrow\infty}. Then we compute

\displaystyle  T_j = H(\Omega_j,{\cal D}),\ \ \ j=1,\ldots, N.

Now we compute

\displaystyle  L(t) = \frac{1}{N}\sum_{j=1}^N I(T_j > t).

Finally, we get the upper quantile of this distribution, {c_n= 2L^{-1}(\alpha)}. (The factor of 2 is for technical reasons). This gives us what we want. Specifically,

\displaystyle  \mathbb{P}(W_\infty(\hat{\cal P}, {\cal P}) > c_n) \leq \alpha+ O\left(\sqrt{ \frac{b \log n}{n}}\right) + \frac{2^d}{n\log n} \approx \alpha

where {d} is the dimension of the set {\mathbb{M}}.

The paper contains several other methods for bounding {W_\infty(\hat{\cal P},{\cal P})}. Well, I hope I have piqued your interest. If so, have a look at the paper and let us know if you have any comments.