Formalizing the proof of PFR in Lean4 using Blueprint: a short tour

What's new 2023-11-18

Since the release of my preprint with Tim, Ben, and Freddie proving the Polynomial Freiman-Ruzsa (PFR) conjecture over {\mathbb F}_2, I have started a collaborative project to formalize this argument in the proof assistant language Lean4. It has been less than a week since the project was launched, but it is proceeding quite well, with a significant fraction of the paper already either fully or partially formalized. The project has been greatly assisted by the Blueprint tool, which allows one to write a human-readable “blueprint” of the proof that is linked to the Lean formalization. For the PFR project, the blueprint can be found here. One feature of the blueprint that I find particularly appealing is the dependency graph that is automatically generated from the blueprint, and can provide a rough snapshot of how far along the formalization has advanced. For PFR, the latest state of the dependency graph can be found here. At the current time, the graph looks like this:

The color coding of the various bubbles (for lemmas) and rectangles (for definitions) is explained in the legend to the blueprint, but roughly speaking the green bubbles/rectangles represent lemmas or definitions that have been fully formalized, and the blue ones represent lemmas or definitions which are ready to be formalized (their statements, but not proofs, have already been formalized, as well as those of all prerequisite lemmas and proofs). The goal is to get all the bubbles leading up to the “pfr” bubble at the bottom colored in green.

In this post I would like to give a quick “tour” of the project, to give a sense of how it operates. If one clicks on the “pfr” bubble at the bottom of the dependency graph, we get the following:

Here we display a human-readable form of the PFR statement. This is coming from the corresponding portion of the blueprint, which also comes with a human-readable proof of this statement that relies on other statements in the project:

However, this part of the proof has not yet been formalized in Lean. Observe that the “pfr” bubble is white, but has a green border. This means that the statement of PFR has been formalized in Lean, but not the proof; and the proof itself is not ready to be formalized, because some of the prerequisites (in particular, entropy-pfr) do not even have their statements formalized yet. If we click on the “Lean” link below the description of PFR in the dependency graph, we are lead to the (auto-generated) Lean documentation for this assertion:

This is what a typical theorem in Lean looks like (after a procedure known as “pretty printing”). There are a number of hypotheses stated before the colon, for instance that G is a finite elementary abelian group of order 2 (this is how we have chosen to formalize the finite field vector spaces {\bf F}_2^n), that A is a non-empty subset of G (the hypothesis that A is non-empty was not stated in the LaTeX version of the conjecture, but we realized it was necessary in the formalization, and will update the LaTeX blueprint shortly to reflect this) with the cardinality of A+A less than K times the cardinality of A, and the statement after the colon is the conclusion: that A can be contained in the sum c+H of a subgroup H of G and a set c of cardinality at most 2K^{12}.

The astute reader may notice that the above theorem seems to be missing one or two details, for instance it does not explicitly assert that H is a subgroup. This is because the “pretty printing” suppresses some of the information in the actual statement of the theorem, which can be seen by clicking on the “Source” link:

Here we see that H is required to have the “type” of an additive subgroup of G. (Lean’s language revolves very strongly around types, but for this tour we will not go into detail into what a type is exactly.) The prominent “sorry” at the bottom of this theorem asserts that a proof is not yet provided for this theorem, but the intention of course is to replace this “sorry” with an actual proof eventually.

Filling in this “sorry” is too hard to do right now, so let’s look for a simpler task to accomplish for now. Here is a simple intermediate lemma “ruzsa-nonneg” that shows up in the proof:

The expression d[X; Y] refers to something called the entropic Ruzsa distance between X and Y, which is something that is defined elsewhere in the project, but for the current discussion it is not important to know its precise definition. The bubble is blue with a green border, which means that the statement has been formalized, and the proof is ready to be formalized also. The blueprint has indicated that this lemma can be deduced from just one preceding lemma, called “ruzsa-diff“:

“ruzsa-diff” is also blue and bordered in green, so it has the same current status as “ruzsa-nonneg”: the statement is formalized, and the proof is ready to be formalized also, but the proof has not been written in Lean yet. The quantity H[X], by the way, refers to the Shannon entropy of X, defined elsewhere in the project, but for this discussion we do not need to know its definition.

Looking at Lemma 3.11 and Lemma 3.13 it is clear how the former will imply the latter: the quantity |H[X] - H[Y]| is clearly non-negative! (There is a factor of 2 present in Lemma 3.11, but it can be easily canceled out.) So it should be an easy task to fill in the proof of Lemma 3.13 assuming Lemma 3.11, even if we still don’t know how to prove Lemma 3.11 yet. Let’s first look at the Lean code for each lemma. Lemma 3.11 is formalized as follows:

Again we have a “sorry” to indicate that this lemma does not currently have a proof. The Lean notation (as well as the name of the lemma) differs a little from the LaTeX version for technical reasons that we will not go into here. (Also, the variables X, \mu, Y, \mu' are introduced at an earlier stage in the Lean file; again, we will ignore this point for the ensuing discussion.) Meanwhile, Lemma 3.13 is currently formalized as

OK, let’s now try to fill in the latter “sorry”. In my local copy of the PFR github repository, I open up the relevant lean file in my editor (Visual Studio Code) and navigate to the “sorry” of “rdist_nonneg”. The accompanying “Lean infoview” then shows the current state of the Lean proof:

Here we see a number of ambient hypotheses (e.g., that G is an additive commutative group, that X is a map from \Omega to G, and so forth; many of these hypotheses are not actually relevant for this particular lemma), and at the bottom we see the goal we wish to prove.

OK, so now I’ll try to prove the claim. This is accomplished by applying a series of “tactics” to transform the goal and/or hypotheses. The first step I’ll do is to put in the factor of 2 that is needed to apply Lemma 3.11. This I will do with the “suffices” tactic, writing in the proof

I now have two goals (and two “sorries”): one to show that 0 \leq 2 d[X;Y] implies 0 \leq d[X,Y], and the other to show that 0 \leq 2 d[X;Y]. (The yellow squiggly underline indicates that this lemma has not been fully proven yet due to the presence of “sorry”s.)

Let’s fill in the first “sorry”. The tactic state now looks like this (cropping out some irrelevant hypotheses):

Here I can use a handy tactic “linarith“, which solves any goal that can be derived by linear arithmetic from existing hypotheses:

This works, and now the tactic state reports no goals left to prove on this branch, so we move on to the remaining sorry, in which the goal is now to prove 0 \leq 2 d[X;Y]:

Here we will try to invoke Lemma 3.11. I add the following lines of code:

I again have two subgoals, one to prove the bound |H[X]-H[Y]| \leq 2 d[X;Y] (which I will call “h”), and then to deduce the previous goal 0 \leq 2 d[X;Y] from h. For the first, I know I should invoke the lemma “diff_ent_le_rdist” that is encoding Lemma 3.11. One way to do this is to try the tactic “exact?”, which will automatically search to see if the goal can already be deduced immediately from an existing lemma. It reports:

So I try this, and it works, leaving me with the final “sorry”:

I can again use the “exact?” tactic and follow its suggestion to establish the matching bound 0 \leq |H[X] - H[Y]|:

(One can find documention for the “abs_nonneg” method here. There are several other search engines available to locate this method as well; for this tour I will just focus on using “exact?”.) To fill in the final “sorry”, I try “exact?” one last time, to figure out how to combine h and h' to give the desired goal, and it works!

(Note that the yellow squiggly underline has now disappeared. The documentation for “ge_trans” may be found here. The reader may observe that this method uses the \geq relation rather than the \leq relation, but in Lean the assertions X \geq Y and Y \leq X are “definitionally equivalent”, allowing tactics such as “exact” to use them interchangeably.)

It is possible to compactify this proof quite a bit by cutting out several intermediate steps (a procedure sometimes known as “code golf“):

And now the proof is done! The current version of Blueprint does not automatically verify the proof (even though it does compile in Lean), so we have to manually update the blueprint as well. The LaTeX for Lemma 3.13 currently looks like this:

I add the “\leanok” macro to the proof, to flag that the proof has now been formalized:

I then push everything back up to the master Github repository. The blueprint will take quite some time (about half an hour) to rebuild, but eventually it does, and the dependency graph (which Blueprint has for some reason decided to rearrange a bit) now shows “ruzsa_nonneg” in green:

And so the formalization of PFR moves a little bit closer to completion. (Of course, this was a particularly easy lemma to formalize, that I chose to illustrate the process; one can imagine that most other lemmas will take a bit more work.)

That concludes the brief tour! If you are interested in learning more about the project, you can follow the Zulip chat stream; you can also download Lean and work on the PFR project yourself, using a local copy of the Github repository and sending pull requests to the master copy if you have managed to fill in one or more of the “sorry”s in the current version.