The Structure of Data and Machine Learning

Computational Complexity 2022-11-10

Terry Tao entitled his 2006 Fields Medal Lecture "The Dichotomy between structure and randomness" and state the Structure Theorem: Every object is a superposition of structured object and a pseudorandom error. He gave many examples and how he used these results to prove (with Ben Green) that the primes contain arbitrarily long linear progressions.

There is a nice Kolmogorov interpretation. Recall K(x) is the smallest program that produces the string x. For a finite set S, K(S) is the smallest program that recognizes membership in S. For a string x, take the largest set S containing x such that K(x) is close to K(S) + log(S). S is the structure and x is a random element of S. Vereshchagin and Vitanyi have a nice paper on this topic.

Machine learning aims to discover the set S from a set of examples x. This is why I think of P = NP giving us an ideal machine learning algorithms--use P = NP to find a circuit that describes S for a time-bounded version of Kolmogorov complexity. Recent tools in machine learning seem to find this S without needing the full power of P = NP.

Consider languages where German or Japanese is a random example of a "natural language". Linguistics tries to understand the structure of natural languages. Recent ML translations algorithms use that structure (without understanding it) to translate between pairs of languages.

How about generative AI? Diffusion methods create a set S of all reasonable images by turning images into random noise. To create images it reverses that process, starting with random noise to create random elements of S. Prompts just add conditions to the process.

I read the Vereschagin-Vitanyi paper back in 2004 and the Kolmogorov structure model goes back to the 1970s. Finding the structure seemed intractable for any interesting application. Now that ML is proving this wrong, the world is a different place.