Primitive sets and von Mangoldt chains: Erdős Problem #1196 and beyond

What's new 2026-05-04

Boris Alexeev, Kevin Barreto, Yanyang Li, Jared Duker Lichtman, Liam Price, Jibran Iqbal Shah, Quanyu Tang, and I have just uploaded to the arXiv our paper Primitive sets and von Mangoldt chains: Erdős Problem \#1196 and beyond. This paper (which is a work in progress) represents our efforts to digest and document the recent flurry of developments around the following problem of Erdős, Sárközy, and Szemerédi on primitive sets:

Conjecture 1 (Erdős problem #1196) Suppose that ${A \subset [x,+\infty)}$ is a primitive set, which means that no element of ${A}$ divides another. Then
$\displaystyle \sum_{n \in A} \frac{1}{n \log n} \le 1 + o(1)$
as ${x \rightarrow \infty}$ .

One can show that the upper bound of ${1 + o(1)}$ is best possible up to the ${o(1)}$ error by taking ${A}$ to be the set of products of ${k}$ primes for some suitable parameter ${k}$ . This was one of the most well-known open problems in the study of primitive sets, and had attracted some number of partial results (for instance, Lichtman was able to show the upper bound of ${e^\gamma \frac{\pi}{4} + o(1) \approx 1.399}$ ). It was thus notable that this problem was first solved by an autonomous AI query (by the fifth author) a few weeks ago. This solution introduced a proof technique – based on Markov chains in the divisibility poset – which in retrospect is very natural for controlling primitive sets, but which had not been explicitly used in previous literature, though in retrospect many of the arguments in that literature involved a specific Markov chain which we call the downwards Mertens chain. The proof instead revolved around a different Markov chain, which we call the downwards von Mangoldt chain, which manages to neatly avoid the “ ${e^\gamma}$ ” type losses in the previous Mertens-based arguments, and resolve Conjecture 1. In this paper we develop the Markov chain approach more systematically, and show that it settles several further conjectures concerning primitive sets, and also provides simpler proofs of some previous results in the literature. More precisely, in addition to Conjecture 1, we establish the following:

Theorem 2 (Erdős primitive set conjecture, #164) For any primitive ${A}$ consisting of numbers greater than ${1}$ ,
$\displaystyle \sum_{n \in A} \frac{1}{n \log n} \le \sum_p \frac{1}{p \log p}.$

Theorem 3 (Odd Banks–Martin) Let ${k\ge 1}$ and suppose ${A}$ is a primitive set consisting of odd numbers with at most ${k}$ prime factors. Then
$\displaystyle \sum_{n \in A} \frac{1}{n \log n} \leq \sum_{n \in \mathbb N_k(\mathcal Q)} \frac{1}{n \log n},$
where ${\mathcal Q}$ denotes the primes appearing as factors of elements of ${A}$ , and ${\mathbb N_k({\mathcal Q})}$ is the collection of products of ${k}$ primes from ${\mathcal Q}$ .

Theorem 4 ( ${2}$ is Erdős-strong) If ${A}$ is a primitive set consisting of even numbers, then
$\displaystyle \sum_{n \in A} \frac{1}{n \log n} \le \frac{1}{2\log 2}.$

Theorem 5 (Ahlswede–Khachatrian–Sárközy) If ${A}$ is a primitive set, then
$\displaystyle \sum_{n \in A \cap [y/x,y]} \frac{1}{n} \ll \frac{\log x}{\sqrt{\log\log x}}$
whenever ${3 \leq x \leq y}$ .

Theorem 6 (Erdős–Sárközy–Szemerédi, #1217) Let ${A \subset {\bf N}}$ be such that the upper doubly logarithmic density
$\displaystyle \Delta := \limsup_{x\rightarrow\infty}\frac{1}{\log\log x} f(A \cap [1,x])$
is positive. Then there exists a strictly increasing infinite divisibility chain
$\displaystyle n_0 | n_1 | n_2 | \dots$
in ${A}$ such that
$\displaystyle \limsup_{x\rightarrow\infty}\frac{1}{\log\log x} \# \{ i: n_i \leq x \} \geq \Delta.$

Theorem 2 and Theorem 5 had been previously established by Lichtman and Ahlswede–Khachatrian–Sárközy respectively, but the Markov chain formalism gives shorter (and more unified) proofs of both. Theorems 3, 4, 6 were open conjectures that can now be settled by this method. These results were obtained with varying levels of AI involvement, ranging from completely autonomous AI queries to traditional pen-and-paper calculations, to various hybrid approaches (for instance, with humans suggesting key inequalities that could then be rapidly tested numerically or even proved by various AI tools).

— 1. Chain/antichain duality and Markov chains —

I’ll now discuss the basic method of proof and try to motivate the main ideas, which have become much clearer in retrospect. Primitive sets can be viewed as antichains in the divisibility poset ${({\bf N},|)}$ , in which the partial ordering is given by the divisibility relation ${a|b}$ . So, one can pose the following more abstract question: given a general poset ${({\mathcal N}, \leq)}$ and a weight function ${\nu : {\mathcal N} \rightarrow {\bf R}^+}$ , what is the maximal value of ${\sum_{n \in A} \nu(n)}$ as ${A}$ ranges over all antichains in ${{\mathcal N}}$ ?

One can attack this problem using the well known duality between antichains and chains ${C}$ (totally ordered subsets of ${{\mathcal N}}$ ): every antichain and chain can meet in at most one point, thus one has

$\displaystyle \sum_{n \in A} 1_{n \in C} \leq 1$

for any chain ${C}$ and any antichain ${A}$ . In particular, if one has a measure ${\mu}$ on the space ${{\mathcal C}}$ of all chains (viewed as a compact subspace of the power set of ${{\mathcal N}}$ , equipped with the product topology) with the property that

$\displaystyle \mu \{ C : n \in C \} \geq \nu(n) \ \ \ \ \ (1)$

for all ${n \in {\mathcal N}}$ , then by integrating the previous inequality against ${\mu}$ and using Tonelli’s theorem one would obtain the upper bound

$\displaystyle \sum_{n \in A} \nu(n) \leq \mu({\mathcal C}).$

In fact this duality is completely tight:

Proposition 7 (Chain/antichain duality) Let ${({\mathcal N}, \leq)}$ be a poset, let ${\nu : {\mathcal N} \rightarrow {\bf R}^+}$ be a weight function, and let ${M > 0}$ . Then the following are equivalent:

(i) ${\sum_{n \in A} \nu(n) \leq M}$ for all antichains ${A}$ .

(ii) There exists a measure ${\mu}$ of total mass at most ${M}$ on the space of chains, such that (1) holds for all ${n \in {\mathcal N}}$ .

Proof: We have already indicated how (ii) implies (i). Now we need to show that (i) implies (ii). A standard compactness argument allows us to reduce to the case when ${{\mathcal N}}$ is finite. If (i) holds, then we also have

$\displaystyle \sum_{n \in {\mathcal N}} f(n) \nu(n) \leq M \ \ \ \ \ (2)$

for all ${f : {\mathcal N} \rightarrow {\bf R}}$ in the Stanley chain polytope, defined as the convex hull of the indicator functions of antichains. By a classic result of Stanley, this polytope can also be defined as the space of all ${f : {\mathcal N} \rightarrow {\bf R}}$ obeying the inequalities

$\displaystyle 0 \leq f(n) \hbox{ for all } n \in {\mathcal N} \ \ \ \ \ (3)$

and

$\displaystyle \sum_{n \in C} f(n) \leq 1 \hbox{ for all chains } C. \ \ \ \ \ (4)$

Applying linear programming duality (or the Farkas lemma), we conclude that the inequality (2) must be a non-negative linear combination of the inequalities (3), (4) (as well as the trivial inequality ${0 \leq 1}$ ). Equivalently, we can find non-negative weights ${\mu(C)}$ for each chain ${C}$ such that

$\displaystyle \sum_{n \in {\mathcal N}} \mu(n) \leq M$

and

$\displaystyle \sum_{C: n \in C} \mu(C) \geq \nu(n)$

for all ${n \in {\mathcal N}}$ . The claim follows by viewing ${\mu}$ as a measure on the space of chains. $\Box$

Thus, the “universal” problem of obtaining a uniform upper bound on ${\sum_{n \in A} \nu(n)}$ for all antichains ${A}$ is replaced with the equivalent “existential” dual problem of exhibiting a single measure ${\mu}$ on chains of controlled mass, which hits each element ${n}$ of the poset with a mass of at least the original weight ${\nu}$ . Thus, such problems are now reduced to that of finding a sufficiently clever construction of such a measure ${\mu}$ . If the mass of ${\mu}$ was normalized to equal ${1}$ , this becomes a probability problem: find a random chain process in the poset ${{\mathcal N}}$ that hits each element ${n}$ of the poset with a sufficiently high probability. (Though in our paper we found it more convenient for technical reasons to not normalize the measure, and allow the mass to take values other than ${1}$ .)

It turns out (in a manner that was not explicitly appreciated in past literature) that particularly good choices of random chain to use here can come from Markov chains. (Here, the term “chain” is now being used in two different ways, but fortunately the order-theoretic concept of a chain and the Markov process-theoretic concept of a chain will be quite compatible in this discussion.) There will be two types of Markov chains on the poset ${{\mathcal N}}$ that will be relevant: downwards Markov chains and upwards Markov chains. Here is our notation for a downwards Markov chain:

Definition 8 (Downwards Markov chain) Let ${({\mathcal N}, \leq)}$ be a poset, and suppose we designate some subset ${{\mathcal A}}$ of ${{\mathcal N}}$ to be the “absorbing states” (in practice these will be the minimal elements of ${{\mathcal N}}$ , although they do not have to be). A downwards Markov chain on ${{\mathcal N}}$ with absorbing states ${{\mathcal A}}$ is a collection of transition probabilities ${P(n \searrow m)}$ for ${n, m \in {\mathcal N}}$ obeying the following axioms:

(i) ${P(n \searrow m) \geq 0}$ , with equality unless ${m < n}$ and ${n \notin {\mathcal A}}$ , or if ${m = n}$ and ${n \in {\mathcal A}}$ .

(ii) For any ${n \in {\mathcal N}}$ , one has ${\sum_{m \in {\mathcal N}} P(n \searrow m) = 1}$ .

(Thus for instance one has ${P(n \rightarrow n)=1}$ for any absorbing state ${n \in {\mathcal A}}$ .)

Given such a downwards Markov chain and an initial state ${n_0 \in {\mathcal N}}$ , one can generate a random decreasing sequence ${n_0 \geq n_1 \geq n_2 \geq \dots}$ by having each ${n_i}$ transition to ${n_{i+1}}$ with probability ${P(n_i \searrow n_{i+1})}$ after conditioning on the past history ${n_0,\dots,n_i}$ of the chain. This sequence will (almost surely) be strictly decreasing until it hits an absorbing state, in which case it stays there forever, although if the descending chain condition is not satisfied it is also possible for the sequence to be strictly decreasing indefinitely. We let ${{\mathbb P}_{n_0 \searrow}}$ denote the law of this decreasing sequence. This construction is already enough to recover the Lubell-Yamomoto-Meshalkin (LYM) inequality:

Theorem 9 (LYM inequality) If ${({\mathcal N}, \subset)}$ is the power set of ${\{1,\dots,n\}}$ with the inclusion partial ordering, and ${A}$ is an antichain in this poset (i.e., a Sperner family), then
$\displaystyle \sum_{S \in A} \frac{1}{\binom{n}{|S|}} \leq 1.$

Proof: We introduce the downwards Markov chain with absorbing state ${\emptyset}$ in which each non-empty subset of ${\{1,\dots,n\}}$ of some cardinality ${0 < k \leq n}$ transitions to a ${k-1}$ -element subset chosen uniformly at random (i.e., with probability ${1/k}$ of transitioning to each). If we start the descending sequence from the maximal element ${\{1,\dots,n\}}$ of ${{\mathcal N}}$ , then one can easily check that each ${S \in {\mathcal N}}$ is hit with probability ${1/\binom{n}{|S|}}$ . Applying Proposition 7 with ${\nu(S) = 1/\binom{n}{|S|}}$ , ${M=1}$ , and ${\mu = {\mathbb P}_{\{1,\dots,n\} \searrow}}$ , we obtain the claim. $\Box$

In the above argument we fixed the initial location ${n_0}$ of the Markov chain, but more generally one can start with any source mass ${b : {\mathcal N} \rightarrow {\bf R}^+}$ and work with the measure ${\mu = \sum_{n_0 \in {\mathcal N}} b(n_0) {\mathbb P}_{n_0 \searrow}}$ for the purposes of applying Proposition 7.

One can also define upwards Markov chains ${P(n \nearrow m)}$ in exact analogy with downwards Markov chains ${P(n \searrow m)}$ (reversing the order in the poset), which now generate random increasing sequences ${n_0 \leq n_1 \leq n_2 \leq \dots}$ rather than decreasing sequences. There is a useful adjoint construction that can convert a downwards Markov chain into an upwards Markov chain: if we have a positive weight ${\nu : {\mathcal N} \rightarrow (0,+\infty)}$ which is invariant under the chain in the sense that

$\displaystyle \nu(n) = \sum_{m \in {\mathcal N} : m > n} \nu(m) P(m \searrow n)$

for any ${n \in {\mathcal N}}$ , then we can define an adjoint upwards Markov chain ${P(n \nearrow m)}$ (with no absorbing state) by the formula

$\displaystyle P(n \nearrow m) = \frac{\nu(m)}{\nu(n)} P(m \searrow n)$

for any ${n < m}$ in ${{\mathcal N}}$ . More generally, if ${\nu}$ is merely sub-invariant in the sense that

$\displaystyle \nu(n) \leq \sum_{m \in {\mathcal N} : m > n} \nu(m) P(m \searrow n)$

for all ${n \in {\mathcal N}}$ , one can still construct an adjoint upwards chain as before, but now one must also add an additional absorbing maximal state ${\infty}$ to ensure that the transition probabilities still sum to one.

A downwards or upwards Markov chain, when equipped with an invariant or sub-invariant measure, also induces a flow network on the poset, in which an edge from ${n}$ to ${m}$ is assigned a flow capacity of

$\displaystyle \nu(m) P(m \searrow n) = \nu(n) P(n \nearrow m).$

One can rewrite the Markov chain arguments in the paper in terms of such flow networks, in which case the arguments often boil down to an application of the discrete divergence theorem, giving very short proofs of many of the above results; see the paper for more discussion. However, we chose to focus more on the Markov chain approach in our presentation, as this formalism is also natural and could potentially be more flexible for further applications.

— 2. The Mertens and von Mangoldt chains —

For the purpose of analyzing primitive sets, there are two downward chains on the natural numbers (with absorbing state ${1}$ )that, in retrospect, are particularly natural to use:

The downwards Mertens chain, in which each ${n>1}$ transitions deterministically to ${n/P(n)}$ , where ${P(n)}$ is the largest prime factor of ${n}$ ; and
the downwards von Mangoldt chain, in which each ${n>1}$ transitions to ${n/q}$ with probability ${\Lambda(q)/\log n}$ for each ${q}$ dividing ${n}$ , with ${\Lambda}$ the von Mangoldt function.

The von Mangoldt weight ${\Lambda}$ is a natural choice here thanks to the fundamental identity

$\displaystyle \sum_{q|n} \Lambda(q) = \log n$

which encodes the fundamental theorem of arithmetic. The two chains are similar in many way to each other: the von Mangoldt process favors the division by the largest prime factor, but does not require it.

The Mertens chain generates deterministic downward divisibility chains

$\displaystyle 1 | p_1 | p_1 p_2 | \dots | p_1 \dots p_k$

starting from a product of ${k}$ primes ${p_1 \leq p_2 \leq \dots \leq p_k}$ , and as such this process was implicit in much of the previous literature on primitive sets. However, it does not quite interact well with the ${\frac{1}{n \log n}}$ weight, which is not invariant or subinvariant with respect to this process. Intuitively, the process of dividing ${n}$ by ${P(n)}$ tends to increasingly select for numbers ${n}$ for which ${P(n)}$ is smaller than expected. Instead, the natural invariant measure for this chain is the Mertens weight

$\displaystyle \nu_{\mathrm{Mertens}}(n) := \frac{e^\gamma}{n} \prod_{p < P(n)} \left(1-\frac{1}{p}\right) \approx \frac{1}{n \log P(n)};$

the verification that this is indeed an invariant weight is a nice exercise in telescoping series. Taking the adjoint of the downwards Mertens chain with respect to this weight and running that chain from ${1}$ gives the upwards Mertens divisibility chain

$\displaystyle 1 | p_1 | p_1 p_2 | \dots$

in which each ${n}$ transitions to ${np}$ for some prime ${p \geq P(n)}$ with probability ${\frac{1}{p} \prod_{P(n) \leq p' < p} \left(1-\frac{1}{p'}\right)}$ . A routine induction shows that each ${n}$ is hit by this chain with a probability of ${\prod_{p < P(n)} \left(1-\frac{1}{p}\right)}$ ; this for instance gives a weak version

$\displaystyle \sum_{n \in A} \frac{1}{n \log n} \le e^\gamma + o(1)$

of Theorem 1, and similarly for the other results discussed above.

The key innovation (which was uncovered by the AI-assisted proofs, though not quite in the notation and framework presented here) is to switch to the von Mangoldt chain, which removes the bias towards numbers whose largest prime factor is small. Indeed, the weight ${\frac{1}{n \log n}}$ now turns out to be sub-invariant (after removing ${n=1}$ ) under this chain, and there is a modification

$\displaystyle \nu_\Lambda(n) := \int_1^\infty \frac{\log n}{\zeta(s) n^s}\ ds = \frac{1 + o(1)}{n \log n}$

of this weight which turns out to be perfectly invariant. (We have an interpretation of this formula in terms of a zeta process that couples together various zeta distributions into a continuous divisibility chain; see the paper for further details.) Taking the adjoint with respect to this weight (or with the original weight ${\frac{1}{n \log n}}$ ) can eliminate the ${e^\gamma}$ loss in the previous argument, and give one of the proofs of Theorem 1 recorded in our paper (there are several other variants of this method that we also present).

One slight defect of the von Mangoldt chain, as compared to the Mertens chain, is that it can “jump over” primitive sets (such as the set of products of ${k}$ primes) due to the fact that it will sometimes multiply or divide by a power of a prime rather than a prime itself. This turns out to be a technical difficulty for many of our applications, resulting in a need to make various small ad hoc modifications to the von Mangoldt chain to eliminate this type of jump.

In order to establish some crucial sub-invariance properties, it turns out (after some standard manipulations) to be useful to obtain good bounds on the negative log-derivative

$\displaystyle - \frac{\zeta'}{\zeta}(s) = \sum_{n=1}^\infty \frac{\Lambda(n)}{n^s}$

of the Riemann zeta function ${\zeta}$ in the region ${s > 1}$ . Here it turns out that there is a clean (and rather efficient) upper bound

$\displaystyle - \frac{\zeta'}{\zeta}(s) \leq \frac{\log 2}{2^{s-1}-1}$

which is equivalent to the non-decreasing nature of the Dirichlet eta function

$\displaystyle \eta(s) = (1-2^{1-s}) \zeta(s)$

in this region. There are many proofs of this fact in the literature, but I would like to record a particularly cute proof, that is in the spirit of other arguments in the paper: one can interpret ${\eta(s)}$ probabilistically as

$\displaystyle \eta(s) = {\mathbb E} \frac{1}{1+e^{-X_s}}$

where ${X_s}$ is a gamma random variable of shape ${s}$ and scale ${1}$ . Because the sum of independent copies of ${X_s}$ and ${X_t}$ has the distribution of ${X_{s+t}}$ , one can couple together all the ${X_s}$ so that they are increasing in ${s}$ , at which point the claim follows.

— 3. Further directions —

Will Sawin and Ofir Gorodetsky have obtained analogues of several of the above results for function field models or permutation models respectively; we briefly discuss these in the paper, although we do not plan to cover these models in depth. We also note another recent use of this technique to solve a separate Erdős problem (\#858) relating to antichains in a variant of the divisibility poset.

The zeta process that we have discovered hints at an emerging theory of the “developmental anatomy of integers”, which differs from the existing topic of anatomy of integers in that it views a large integer ${n}$ (and its prime factor “organs”) not as a static entity, but rather as an evolving process in which primes (or powers of primes, in the case of the von Mangoldt process) are added or removed to the integer over time. With this perspective, primitive sets can be viewed as singular moments in such a developmental process, which are only encountered at most once in the life cycle of a given integer. It seems of interest to study this developmental perspective further.

The paper is currently a work in progress; we have released an early version due to the public interest in this problem. We plan to explore some further applications, and also to formalize more of the above results in Lean (currently two of the six main theorems are formalized), before submitting the paper for publication. The situation here highlights a distinction I have recently made between three components of the problem solving process in mathematics, namely proof generation, proof verification, and proof digestion. In this particular case, the first two steps were extremely rapid due to modern AI tools; however, properly digesting the AI-generated proofs into a coherent exposition that places the arguments in context with both past literature and future directions remains a slower process that requires expert human attention.