Reading and writing proofs

Win-Vector Blog 2016-03-01

In my recent article on optimizing set diversity I mentioned the primary abstraction was of “diminishing returns” and is formalized by the theory of monotone submodular functions (though I did call out some of my own work which used a different abstraction). A proof that appears again and again in the literature is: showing that when maximizing a monotone submodular function the greedy algorithm run for k steps picks a set that is scores no worse than 1-1/e less than the unknown optimal pick (or picks up at least 63% of the possible value). This is significant, because naive optimization may only pick a set of value 1/k of the value of the optimal selection.

The proof that the greedy algorithm does well in maximizing monotone increasing submodular functions is clever and a very good opportunity to teach about reading and writing mathematical proofs. The point is: one needs an active reading style as: most of what is crucial to a proof isn’t written, and that which is written in a proof can’t all be pivotal (else proofs would be a lot more fragile than they actually are).

Uwe Kils “Iceberg”

In this article I am attempting to reproduce some fraction of the insight found in: Polya “How to Solve It” (1945) and Doron Zeilberger “The Method of Undetermined Generalization and Specialization Illustrated with Fred Galvin’s Amazing Proof of the Dinitz Conjecture” (1994).

So I repeat the proof here (with some annotations and commentary).

Introduction

Reading a mathematics paper is a very active task. Proofs are often abridged (requiring the reader to re-invent sections to follow along), long (requiring a reader to check things sequentially), or applications of diffuse networks of lemmas (requiring the reader to run all over the paper and attempt to reconstruct a chain of reasoning). Even for a good mathematician the situation is hopeless without time and scratch paper. There is less difference between reading a mathematical proof and writing a mathematical proof than the non-mathematician would think.

For the material we are discussing the original proofs are very much lemma based (Nemhauser, George L., Laurence A. Wolsey, and Marshall L. Fisher. “An analysis of approximations for maximizing submodular set functions—I.” Mathematical Programming 14.1 (1978): 265-294). Later tellings re-chain the argument to linear checkable derivation (for example: “Submodular Function Maximization”, Andreas Krause (ETH Zurich), Daniel Golovin (Google) and Jermey Kun, “When Greedy Algorithms are Good Enough: Submodularity and the (1 – 1/e)-Approximation”, the writeup we will follow).

There are a number of categories of mathematical results:

Those for which once stated the proof technique is obvious to the expert. (Note these are not necessarily trivial results. I believe it was Rota who wrote much of the value of mathematics is find definitions that have this property!)
Those for which the proof can be remembered.
Those for which the proof can be checked, but can not be remembered. Often this is the case when the proof uses domain agnostic algebra (which carries the facts, but often not the human intuition).
Those for which the proof is utterly unmanageable.

It is only in the first two cases you really have any feeling for why the theorem may be true.

When reading mathematical papers (which can take days) it is my practice to put the paper aside before finishing it. I then try to guess the outline of the proof. If I can’t do that I then read the proof and try to reproduce the proof from memory. These partial reads help internalize the argument and orient the result into my list above.

In this note I will try to reproduce one common proof in such a way that the reader can see why the greedy algorithm can approximately maximize a monotone increasing submudular set function.

Definitions

We are going to be working with finite sets and a function f() over such sets. f() maps sets to scores (real numbers) and has the following properties:

f({}) = 0 (the empty set scores as zero).
f(Y) ≥ f(X) if X contained in Y (f() is monontone increasing).
f(X) + f(Y) ≥ f(X union Y) + f(X intersect Y) (f() is submodular).

The submodular property is always the confusing bit. Notice the submodular relation (property 3) would be an equality if f() was the function that counted the number of items in a set (think of such an f() as being “modular” from the Latin modus or measure, though be warned this is not current mathematical use of the word “modular”). A submodular function f() can be informally thought of as a size measuring function with the possible defect that f(X union Y) may be, due to a law of diminishing returns, smaller than one would expect (i.e. possibly smaller than f(X) + f(Y) - f(X intersect Y), which is the content of statement 3).

We need some additional notation (but please trust we are introducing the minimal amount of notation needed to move forward).

Let {x(1), x(2), ...} be the items picked by a single long run of the so-called greedy algorithm. That is x(i) is an element such that f(x(1),x(2),...,x(i)) = max_x f(x(1),x(2),...,x(i-1),x) (assume we are picking elements from a finite set of possibilities so there are no problems of “choice” or in achieving suprema).

Define G(k) = {x(1), x(2), ..., x(k)} as the “greedy pick of size k”.

Define O(k) as a set of cardinality k such that f(O(k)) is maximal (an unknown best pick).

The argument

We want to show: f(G(k)) ≥ (1-1/e) f(O(k)).

The proof is outlined as follows.

We need a clever idea to start. To be honest we don’t have a good way to compare f(G(k)) to f(O(k)) for k > 1. We probably can work out the structure of G(k) (from our knowledge of how the greedy algorithm works), but it is going to be hard to nail down facts about O(k) (as it is often hard to characterize global optima in discrete or combinatorial settings). It is important to spend time thinking of variations of what you are trying to prove. You can waste a lot of time trying a good technique (like induction) on the wrong invariant (or thing to prove).

The clever idea is as follows: if G(k) and O(k) were identical then (G(k) union O(k)) would be in turn be identical to O(k). I have always found the way to prove inequalities is find some equality you wish were true (which will not be true) and try to translate things that would be derived from the equality. So we are going to look for things that would follow if G(k) = O(k) and see which of these can be used to work forward.

So let’s try to prove f(G(i) union O(k)) can’t be much bigger than f(O(k)) for all i and then check if knowing something like that is enough to complete the proof. Notice we have both specialized (found a new specific thing to try and prove) and generalized (decided to be clear about insisting on all values of i simultaneously, and not just i=k). Deciding what to prove is the “heavy lifting” of the proof, everything else is just technique. Unfortunately while this part is shared by all the common proofs, it is against publication style to always call it out or motivate it (the unfortunate feeling being: if it works it doesn’t need motivation).

To bound f(G(i) union O(k)) in terms of f(O(k)) notice that the set (G(i) union O(k)) can be formed by starting with the set G(i) and adding the elements from O(k). That is: extending the greedy choice with more elements, which is something we can reason about. For any x in O(k) we have: f(G(i) union {x}) ≤ f(G(i+1)). This follows directly from the definition of the greedy algorithm (there can’t be a better next single element choice, else greedy would have picked it). The set (G(i) union O(k)) can be reached from G(i) by at most k such additions so we immediately have:

f(G(i) union O(k)) - f(G(i)) ≤ k (f(G(i+1)) - f(G(i))).

(The above statement is also where we use the submodularity of f(): saying a k of additions can not be any larger than k times the first addition. It is always good to double check if and where your properties are actually being used in a proof.)

Now because f() is monotone we know f(O(k)) ≤ f(G(i) union O(k)). So we can replace our above relation with the (possibly weaker, but still true) statement:

f(O(k)) - f(G(i)) ≤ k (f(G(i+1)) - f(G(i))).

As we said: replacing f(G(i) union O(k)) with f(O(k)) was valid (it was on the correct side of the inequality given we know f(O(k)) ≤ f(G(i) union O(k))) but possibly a weaker statement than the original. Following our “imitate the equality” principle we hope not too lose too much to the substitution as in the fantasy equality case we would have (G(i) union O(k))= O(k) and there would be no loss in such a substitution at all. The benefit is: we have gotten rid of the artificial set (G(i) union O(k)) (which was introduced only to drive the proof) and now have a statement in terms of only sets we were initially interested in (the O(k) and G(i)).

So we have a relation, but it is hard to tell if we have achieved anything. It turns out the above relation says quite a lot (has in fact proved the theorem), but we can’t see that until we clean it up a bit using algebra.

Re-collect:

f(O(k)) - f(G(i)) ≤ k (f(G(i+1)) - f(G(i)))

k f(O(k)) - k f(G(i+1)) ≤ (k-1) f(O(k)) - (k-1) f(G(i)).

The goal is have all the G(i) and G(i+1) on one side of the relation and to have everything on a given side of the relation have the same multiplier (in this case k or k-1). We are trying to find if our relation means anything by mechanically neatening it up a bit.

We can now assert:

k (f(O(k)) - f(G(i+1))) ≤ (k-1) (f(O(k)) - f(G(i)))).

It is subtle, but at this point we are already done.

Define a(i) = f(O(k)) - f(G(i+1)). We have:

a(0) = f(O(k))
a(k) = f(O(k)) - f(G(k))
a(i+1) ≤ (1-1/k) a(i)

Point 3 immediately implies a(k) ≤ (1-1/k)^k a(0) or (substituting in points 1 and 2):

f(O(k)) - f(G(k)) ≤ (1-1/k)^k f(O(k)).

Given (1-1/k)^k ≤ 1/e for k ≥ 1 we can say:

f(O(k)) - f(G(k)) ≤ f(O(k))/e.

And we have completed proving the claim.

Commentary

The obvious criticism is: the derivation above was perhaps longer that it was worth (or certainly longer than makes sense to publish).

This is in fact why mathematical writing (as long as it can be) is considered very dense (and hence “concise” relative to what is actually going on).

The formally published derivations spend their time showing how to derive “f(G(i) union O(k)) - f(G(i)) ≤ k (f(G(i+1)) - f(G(i)))” in a forward fashion (notice we instead claim it and then give backtracking arguments as to why you should trust it). Using more steps (and fewer explanations) makes the proof easier to check, but can obscure the reasoning as it is not common to share the heuristics directing the steps. In my opinion the proof is over once you establish this line as it is mere work to establish “f(G(k)) ≥ (1-1/e) f(O(k))” from that point (only experience and technique are needed, not any new ideas).

An experienced mathematician reading such a proof will tend to annotate it in any number of ways:

They may write check-marks to confirm statements follow from each other in forward arguments.
They may prove claimed properties by writing new proofs or derivations in the margins.
They may underline a small number of key steps. For example underlining only f(G(i) union O(k)) - f(G(i)) ≤ k (f(G(i+1)) - f(G(i))) highlights the idea of introducing the set (f(G(i) union O(k)) and shows exactly what intermediate theorem needs to be proven.

My study style is as follows: if it is something in my field I shouldn’t have to read the proof. Or at least I shouldn’t have to re-read it too often. To do this I attempt to re-derive the proof in a blank notebook while commuting. I allow myself to skip confirming familiar steps, but I have to be able to “think of” the next steps. Since I am not good at detailed rote memorization it is easiest for me to motivate the steps and then remember the motivation (and re-derive the steps). The payoff is: I may be able to re-use the motivations and habits in later work.

The above article is an expansion of one page of one such notebook.