247B, Notes 4: almost everywhere convergence of Fourier series
What's new 2020-05-14
This set of notes discusses aspects of one of the oldest questions in Fourier analysis, namely the nature of convergence of Fourier series.
If is an absolutely integrable function, its Fourier coefficients are defined by the formula
If is smooth, then the Fourier coefficients are absolutely summable, and we have the Fourier inversion formula
where the series here is uniformly convergent. In particular, if we define the partial summation operators
then converges uniformly to when is smooth.
What if is not smooth, but merely lies in an class for some ? The Fourier coefficients remain well-defined, as do the partial summation operators . The question of convergence in norm is relatively easy to settle:
Exercise 1
- (i) If and , show that converges in norm to . (Hint: first use the boundedness of the Hilbert transform to show that is bounded in uniformly in .
- (ii) If or , show that there exists such that the sequence is unbounded in (so in particular it certainly does not converge in norm to . (Hint: first show that is not bounded in uniformly in , then apply the uniform boundedness principle in the contrapositive.)
The question of pointwise almost everywhere convergence turned out to be a significantly harder problem:
Note from Hölder’s inequality that contains for all , so Carleson’s theorem covers the case of Hunt’s theorem. We remark that the precise threshold near between Kolmogorov-type divergence results and Carleson-Hunt pointwise convergence results, in the category of Orlicz spaces, is still an active area of research; see this paper of Lie for further discussion.
Carleson’s theorem in particular was a surprisingly difficult result, lying just out of reach of classical methods (as we shall see later, the result is much easier if we smooth either the function or the summation method by a tiny bit). Nowadays we realise that the reason for this is that Carleson’s theorem essentially contains a frequency modulation symmetry in addition to the more familiar translation symmetry and dilation symmetry. This basically rules out the possibility of attacking Carleson’s theorem with tools such as Calderón-Zygmund theory or Littlewood-Paley theory, which respect the latter two symmetries but not the former. Instead, tools from “time-frequency analysis” that essentially respect all three symmetries should be employed. We will illustrate this by giving a relatively short proof of Carleson’s theorem due to Lacey and Thiele. (There are other proofs of Carleson’s theorem, including Carleson’s original proof, its modification by Hunt, and a later time-frequency proof by Fefferman; see Remark 18 below.)
— 1. Equivalent forms of almost everywhere convergence of Fourier series —
A standard technique to prove almost everywhere convergence results is by first establishing a weak-type estimate of an associated maximal function. For instance, the Lebesgue differentiation theorem is usually established with the assistance of the Hardy-Littlewood maximal inequality; see for instance this previous blog post. A remarkable observation of Stein, known as Stein’s maximal principle, allows one to reverse this implication in certain cases by exploiting a symmetry of the problem. Here is the principle specialised to the application of pointwise convergence of Fourier series, and also combined with a transference principle of Kenig and Tomas:
Proposition 3 (Equivalent forms of almost everywhere convergence) Let . Then the following statements are equivalent:
- (i) For every , one has for almost every .
- (ii) There does not exist such that for almost every .
- (iii) One has the maximal inequality
for all smooth , where the weak norm is defined as
and denotes the Lebesgue measure of a set (which in this setting is a subset of the unit circle).
- (iv) One has the maximal inequality
- (v) One has the maximal inequality \| \sup_{N \in R} |1_{D \leq N} f| \|_{L^{p,\infty}(R)} \lesssim_p \|f\|_{L^p(R)} for all , where denotes the Fourier multiplier operator
Among other things, this proposition equates the qualitative property (i) of almost everywhere convergence to the quantitative property (iii) of a maximal inequality. This equivalence (first observed by Calderón) is similar in spirit to the uniform boundedness principle (see e.g. Corollary 1 of this previous blog post). The restriction is needed for just one implication (from (iii) to (ii)) in the arguments below, and arises due to the use of Khintchine’s inequality at one point. The equivalence of (iv) and (v) is part of a more general principle of transference that allows one to pass back and forth between periodic domains such as with non-periodic domains such as (or, on the Fourier side, between discrete domains and continuous domains ) if the estimates in question enjoy suitable scaling symmetries. We will use the formulation (v), as it enjoys the most symmetries.
Proof: We first show that (iii) implies (i). If (1) holds for all smooth , then certainly for all finite one has
for all smooth , and hence for all by a standard limiting argument. Taking limits as (using variants of Fatou’s lemma) we conclude that (1) holds for all . This implies in particular from the triangle inequality that
for all . On the other hand, the left-hand side vanishes whenever is smooth, since Fourier series converge uniformly in this case. Since smooth functions are dense in , we conclude from standard limiting arguments that the left-hand side in fact vanishes for all , giving the claim (iii).
Clearly (i) implies (ii). Now we assume that (iii) fails and use this to show that (ii) fails as well. From the failure of (iii) and monotone convergence, for any one can find , a measurable subset of , a finite , and such that
In particular, has positive measure. By homogeneity we may normalise . At this stage, nothing prevents the measure of from being much smaller than ; but we can exploit translation invariance to increase the measure of to be comparable to as follows. Let be the integer part of . We claim that there exist translations of whose union has measure comparable to :
This is easiest to establish by the probabilistic method (which in this context we might call the random translation method). If we select uniformly and independently at random we see that every point will lie in a given translate (or equivalently, that lies in ) with probability , hence
Integrating in and using the Fubini-Tonelli theorem, we conclude that
and hence there exists deterministic choices of for which
By definition of the RHS is comparable to , giving the claim (clearly the left-hand side cannot exceed ).
Now consider the randomised linear combination
of translates of , where are random Bernoulli signs. From Khintchine’s inequality and the hypothesis we have
hence by construction of and (4)
Now we study the behaviour of when . Since is a convolution operator, it commutes with translations, and hence
for each . On the other hand, from (3) we have
and hence there exists such that
In particular, the square function
is at least . Meanwhile, from Khintchine’s inequality and (7) we have
for all . Applying the Paley-Zygmund inequality (setting , for instance) we conclude that
(for suitable choices of implied constants), so in particular
Integrating in using (5), and applying the Fubini-Tonelli theorem, we conclude that
hence by (6) one has
In particular, there exists a deterministic choice of signs (and hence of for which
On the other hand, the left-hand side is at most . We conclude that for every , we can find a smooth function with
and a finite , as well as a set of measure , such that
for all .
Applying this fact iteratively (each time choosing to be sufficiently large depending on all previous choices), we can construct a sequence of smooth functions , finite , and sets for such that
- (a) for all .
- (b) for all .
- (c) for all (note that the right-hand side is finite since the are smooth).
- (d) for all (note that the left-hand side is bounded by ).
By randomly translating the (and ), and using the Borel-Cantelli lemma, we may assume without loss of generality that almost every lies in infinitely many of the . If one then sets
we see that , and from the triangle inequality we see that for almost every we have
for infinitely many , which implies in particular that
for almost all . This shows that (ii) must fail, as required.
and the triangle inequality we see that (iv) implies (iii). Conversely, suppose that (iii) holds. We will take a limit using frequency modulations. For any smooth , we apply (iii) to the modulated function , where is a natural number and , to get
Since , we conclude that
restricting to the range for some given natural number , we conclude for that
Since is smooth, it has rapidly decreasing Fourier coefficients, which implies that converges uniformly to zero as . We conclude that
and then we obtain (iv) by monotone convergence.
Now we assume (iv) and work to establish (v). The idea here is to use a rescaling argument, viewing as the limit as of the large circle (in physical space) or the fine lattice (in frequency space).
By limiting arguments we may assume that is compactly supported on some interval . Let be a large scaling parameter, and consider the periodic function defined by
For large enough, this function is smooth and supported on the interval , with norm
The Fourier coefficients of is given as
so that
Applying (iv), we see that for any , we have
Rescaling by , we conclude that
We can let range over the reals rather than the integers as this does not affect the constraint . Rescaling by , we see that for any compact intervals , we have
By uniform Riemann integrability and the rapid decrease of
uniformly for , . We conclude that
By monotone convergence we may replace with , and we then obtain (v).
Finally, we assume (v) and establish (iv). By a limiting argument it suffices to establish (iv) for trigonometric polynomials , that is to say periodic functions whose Fourier coefficients are supported in for some natural number . Let be a non-zero Schwartz function is supported in , and for a given scaling parameter let denote the Schwartz function
For sufficiently large one easily checks that
The Fourier transform of can be calculated as
hence (for large enough)
and thus
From (v) we conclude that for any we have
For large enough, the left-hand side is
for some depending on . Dividing by and replacing by , we obtain the claim (iv).
Exercise 4 For , let denote the Fejér summation operators
- (i) For any , establish the pointwise bound
where is the Hardy-Littlewood maximal function
- (ii) Show that for , one has for almost all .
Exercise 5 (Pointwise convergence of Fourier integrals) Let be such that the conclusion of Theorem 3(v) holds. Show that for any , one has for almost all , where is defined for Schwartz functions by the formula
and then extended to by density.
Exercise 6 Let . Suppose that is such that one has the restriction estimate
for all Schwartz functions , where denotes the surface measure on the sphere . Conclude that
for all Schwartz functions . (This observation is due to Bourgain. In particular, by Marcinkiewicz interpolation, implies for all .
We are now ready to establish Kolmogorov’s theorem (Theorem 2(i)); our arguments are loosely based on the original construction of Kolmogorov (though he was not in possession at the time of the Stein maximal principle). In view of the equivalence between (ii) and (v) in Theorem 3, it suffices to show that the maximal operator
fails to be of weak-type on Schwartz functions. Recalling that the Hilbert transform
is also a Fourier multiplier operator
some routine calculations then show that
for any Schwartz function . By the triangle inequality, it then suffices to show that the maximal operator
fails to be of weak type on Schwartz functions.
To motivate the construction, note from a naive application of the triangle inequality that
If the function was absolutely integrable, then by Young’s inequality we would conclude that the maximal operator was strong type , and hence also weak type . Thus any counterexample must somehow exploit the logarithmic divergence of the integral of . However, there are two potential sources of cancellation that could ameliorate this divergence: the sign of the Hilbert kernel , and the phase . But because of the supremum in , we can select the frequency parameter as we please, as long as it depends only on and not on . The idea is then to choose (and the support of ) to remove both sources of cancellation as much as possible.
We turn to the details. Let be a large natural number, and then select widely separated frequency scales
In order to assist with removing cancellation in the phases later, we will require these scales to be integers. The precise choice of scales is not too important as long as they are widely separated and integer valued, but for sake of concreteness one could for instance set . Let be a bump function of total mass supported on , and let be the Schwartz function
thus is an approximation (in a weak sense) to the sum of Dirac masses , with the frequency scale of the approximation to increasing rapidly in . We easily compute the norm of :
Now we estimate for in the interval for some natural number ; note the set of all such has measure . In this range we will test the maximal operator at the frequency cutoff :
As is supported in , we see (for large enough) that avoids the support of and we can replace the principal value integral with the ordinary integral. Substituting (9) and performing some linear changes of variable and using the support of , we conclude that
As is an integer, the phase is equal to . We also cancel out the phase as being independent of , thus
For , we exploit the oscillatory nature of the phase through an integration by parts, leading to the bound
(one could even gain a factor of here if desired, but we will not need it). Summing, we have
For , we instead exploit the near-constant nature of the phase by writing
and similarly
to conclude that
Summing and combining with (11), we conclude (from the rapidly increasing nature of the ) that
and thus (for large)
Comparing this with (10) we contradict the conclusion of Theorem 3(iv), giving the claim.
Remark 7 In 1926, Kolmogorov refined his construction to obtain a function whose Fourier sums diverged everywhere (not just almost everywhere).
Exercise 8 (Radamacher-Menshov theorem)
- (i) Let be some square-integrable functions on a probability space , with a power of two. By performing a suitable Whitney type decomposition (similar to that used in Section 3 of Notes 1), establish the pointwise bound
where for each , ranges over dyadic intervals of the form with . If furthermore the are orthogonal to each other, establish the maximal inequality
- (ii) If is a trigonometric polynomial with at most non-zero coefficients for some , use part (i) to establish the bound
- (iii) If lies in the Sobolev space
for some , use (ii) to show that for almost every .
— 2. Carleson’s theorem —
We now begin the proof of Carleson’s theorem (Theorem 2(ii)), loosely following the arguments of Lacey and Thiele (we briefly comment on other approaches at the end of these notes). In view of Proposition 3, it suffices to establish the weak-type bound
for Schwartz functions . Because of the supremum, the expression depends sublinearly on rather than linearly; however there is a trick to reduce matters to considering linear estimates. By selecting, for each , to be a frequency which attains (or nearly attains) the supremal value of , it suffices to establish the linearised estimate
for all measurable functions , where is the operator
One can think of this operator as the (Kohn-Nirenberg) quantisation of the rough symbol . Unfortunately this symbol is far too rough for us to be able to use pseudodifferential operator tools from the previous set of notes. Nevertheless, the “time-frequency analysis” mindset of trying to efficiently decompose phase space into rectangles consistent with the uncertainty principle will remain very useful.
To avoid some very minor technicalities we can assume (by a limiting argument) that never takes values equal to a dyadic rational ; in particular, it belongs to at most one dyadic interval of each given length scale.
The next step is to dualise the weak norm to linearise the dependence on even further:
Exercise 9 Let , let be a -finite measure space, let be a measurable function, and let . Show that the following claims are equivalent (up to changes in the implied constants in the asymptotic notation):
- (i) One has .
- (ii) For every subset of of finite measure, the function is absolutely integrable on , and
In view of this exercise, we see that it suffices to obtain the bound
for all Schwartz , all sets of finite measure, and all measurable functions . Actually only the restriction of to is relevant here, so one can view as a function just on if desired. The operator can be viewed as the quantisation of the (very rough) symbol , that is to say the indicator function of the region lying underneath the graph of :
A notable feature of the estimate (12) is that it enjoys three different symmetries, each of which is “non-compact” in the sense that it is parameterised by a parameter taking values in a non-compact space such as or :
- (i) (Translation symmetry) For any spatial shift , both sides of (12) remain unchanged if we replace by , the set by the translate , and the function by .
- (ii) (Dilation symmetry) For any scaling factor , both sides of (12) become multiplied by the same scaling factor if we replace by , by the dilate , and the function by .
- (iii) (Modulation symmetry) For any frequency shift , both sides of (12) remain unchanged if we replace by , do not modify the set , and replace the function by .
Each of these symmetries corresponds to a different symmetry of phase space , namely spatial translation , dilation , and frequency translation respectively. As a general rule of thumb, if one wants to prove a delicate estimate such as (12) that is invariant with respect to one or more non-compact symmetries, then one should use tools that are similarly invariant (or approximately invariant) with respect to these symmetries. Thus for instance Littlewood-Paley theory or Calderón-Zygmund theory would not be suitable tools to use here, as they are only invariant with respect to translation and dilation symmetry but absolutely fail to have any modulation symmetry properties (these theories prescribe a privileged role to the frequency origin, or equivalently they isolate functions of mean zero as playing a particularly important role).
Besides the need to respect the symmetries of the problem, one of the main difficulties in establishing (12) is that the expression , couples together the function with the function in a rather complicated way (via the frequency variable ). We would like to try to decouple this interaction by making and instead interact with simpler objects (such as “wave packets”), rather than being coupled directly to each other. To motivate the decomposition to use, we begin with a heuristic discussion. In analogy to the Whitney type decompositions used in Notes 1, one can split
for almost all choices of and (at least if have the same sign), where range over pairs of dyadic intervals that are “close” in the sense that and that and are not adjacent, but their parents are adjacent, and with to the left of . If one ignores the caveats and blindly substitutes in the decomposition (13), the expression in the left of (12) becomes
To decouple further, we will try to decompose into “rank one” operators. More precisely, we manipulate
It will be convenient to try to discretise this integral average. From the uncertainty principle, modifying by should only modify approximately by a phase, so the integral here is roughly constant at spatial scales . So we heuristically have
If we now define a tile to be a rectangle in phase space of the form
where are dyadic intervals and with unit area , we see that every in the above sum is associated to a tile . The interval is then similarly assocated to a nearby tile , and we write to indicate the relationship between the two tiles (they share the same spatial interval , but lies just above ). We can then approximately write the left-hand side of (12) as
is an -normalised “wave packet” that is roughly localised to in phase space. This approximate form of (12) has achieved the goal of decoupling the function from the data , as they both now interact with the tile pair rather than through each other. Note also that the set of tiles obeys an approximate version of the three symmetries that (12) does. Firstly, the set of tiles is invariant under dilations if is a power of two; secondly, once one fixes the scales of the tiles, the remaining set of tiles is invariant under spatial translations by integer multiples of the spatial scale , and under frequency translations by integer multiples of . (We will need the discrete and nested nature of the tiles for some subsequent combinatorial arguments, and it turns out to be worthwhile to accept a slightly degraded form of the three basic symmetries of the problem in return for such a discretisation.)
We now make the above heuristic decomposition rigorous. For any dyadic interval , let denote the left child interval, and the right parent interval. We fix a bump function supported on normalised to have norm ; henceforth we permit all implied constants in the asymptotic notation to depend on . For each interval let denote the rescaled function
noting that this is a bump function supported on . We will establish the estimate
where ranges over all dyadic intervals. We assume (15) for now and see why it implies (12). The left-hand side of (15) is not quite dilation or frequency modulation invariant, but we can fix this by an averaging argument as follows. Applying the modulation invariance, we see for any that
since
we thus have
We temporarily truncate to a finite range of scales, and use the triangle inequality, to obtain
for any finite . For fixed , the expression
is periodic in with period , with average equal to
which we can rewrite as
which one can rewrite further (using the change of variables ) as
where
Hence if we average over all in (say) , we conclude that
and hence on sending to infinity
Using dilation symmetry, we also see that
for any . Averaging this for with Haar measure , we conclude that
But as is a bump function supported in , one has
The quantity is a non-zero constant, hence
which is (12).
It remains to prove (15). As in the heuristic discussion, we approximately decompose the convolution into a sum over tiles. We have
Motivated by this, we define as before a tile to be a rectangle with dyadic intervals with ; we also split each such tile into an upper half and a lower half . We refer to as the spatial scale of the tile, and the reciprocal as the frequency scale. For each tile define the wave packet
which is a Schwartz function with Fourier support in (in fact it is supported in ) that is normalised to have norm and is localised spatially near , so morally it has “phase space support in “. We will later establish the estimate
for all and sets of finite measure (cf. (14)), where ranges over the set of all tiles. For now, we show why this estimate implies (15) and hence (12). Just as (12) was obtained from (15) by averaging over dilation and frequency modulations, we shall recover (15) from (17) by averaging over spatial translations. As before, we first temporarily restrict the size range of and use the triangle inequality to obtain
Applying translation symmetry, we conclude that
for any . The left-hand side may be rewritten as
where we extend the definition of to translated tiles in the obvious fashion. The expression inside the absolute values is periodic in with period , and averages to
which by (31) simplifies to
and so on averaging in and then sending to infinity we recover (15).
It remains to establish (17). It is convenient to introduce the sets
so that the target estimate (17) simplifies slightly to
As advertised, we have now decoupled the influences of and the influences of (which determine the sets ), as these quantities now only directly interact with the wave packets , rather than with each other. Moreover, in some sense only interacts with the lower half of the tile (as this is where is concentrated), while and only interact with the upper half of the tile.
One advantage of this “model” formulation of the problem is that one can naturally build up to the full problem by trying to establish estimates of the form
where is some smaller set of tiles. For instance, if we can prove (19) for all finite collections of tiles, then by monotone convergence we recover the required estimate.
The key problem here is that tiles have three degrees of freedom: scale, spatial location, and frequency location, corresponding to the three symmetries of dilation, spatial translation, and frequency modulation of the original estimate (12). But one can warm up by looking at families of tiles that only exhibit two or fewer degrees of freedom, in a way that slowly builds up the various techniques we will need to apply to establish the general case:
The case of a single tile We begin with the simplest case of a single tile (so that there are zero degrees of freedom):
On the one hand, is normalised in , by Cauchy-Schwarz we have
From the construction of we see that we have the pointwise bounds
where we use to denote the following variant of the indicator function that has a non-trivial tail:
In particular from Hölder’s inequality we have the bounds
thanks to the trivial bound , so on taking geometric means we have
and the claim (20) follows.
The case of separated tiles of fixed scale Now we let be a collection of tiles all of a fixed spatial scale (so that (so that we have the two parameters of spatial and frequency location, but not the scale parameter). Among other things, this makes the tiles in essentially disjoint (i.e., disjoint ignoring sets of measure zero). This disjointness manifests itself in two useful ways. Firstly, we claim that we can improve the trivial bound
secondly, we claim that we can improve the Cauchy-Schwarz bound (21) to
If we assume (24), (25) for now, then by combining (24) with (23) we have
and then from (25) and Cauchy-Schwarz we obtain the required bound (19) in this case.
Now let us see why (24) is true. To motivate the argument, suppose that had no tail outside of , so that one could replace to in (22). Then would have
and as the tiles are all essentially disjoint the claim (24) would then follow from summing in , since each contributes to at most one of the sets . Now we have to deal with the contribution of the tails. We can bound
For each , there is at most one dyadic interval of the fixed length such that . Thus in the above sum is fixed, and only can vary; from (22) we then see that , giving (24).
Now we prove (25). The intuition here is that the essential disjointness of the tiles make the approximately orthogonal, so that (25) should be a variant of Bessel’s inequality. We exploit this approximate orthogonality by a method, which we perform here explicitly. By duality we have
for some coefficients with , so by Cauchy-Schwarz it suffices to show that
From the Fourier support of we see that the inner product vanishes unless the intervals overlap which by the equal sizes of force . In this case we can use (22) to bound the inner product by
and then a routine application of Schur’s test gives (26). This establishes (25), giving (19) in the case of tiles of equal dimensions.
The case of a regular -tree
Now we attack some cases where the tiles can vary in scale. In phase space, a key geometric difficulty now arises from the fact that tiles may start partially overlapping each other, in contrast to the previous case in which the essential disjointness of the tile set was crucial in establishing the key estimates (24), (25). However, because we took care to restrict the intervals of the tiles to be dyadic, there are only a limited number of ways in which two tiles can overlap. Given two rectangles and , we define the relation if and ; this is clearly a partial order on rectangles. The key observation is as follows: if two tiles overlap, then either or . Similarly if are replaced by their upper tiles or by their lower tiles . Note that if are tiles with , then one of or holds (and the only way both inequalities can hold simultaneously is if ).
As was first observed by Fefferman, a key configuration of tiles that needs to be understood for these sorts of problems is that of a tree.
Definition 10 Let be a tile. A tree with top is a collection of tiles with the property that for all . (For minor technical reasons it is convenient to not require the top to actually lie in the tree , though this is often the case.) We write for the spatial support of the tree, and for the frequency support of the tree top. If we in fact have for all , we say that is a -tree; similarly if for all , we say that is a -tree. (Thus every tree can be partitioned into a -tree and a -tree with the same top as the original tree.)
The tiles in a tree can vary in scale and in spatial location, but once these two parameters are given, the frequency location is fixed, so a tree can again be viewed as a “two-parameter” subfamily of the three-parameter family of tiles.
We now prove (19) in the case when is a -tree , thus for all . Here, the factors will all “collide” with each other and there will be no orthogonality to exploit here; on the other hand, there will be a lot of “disjointness” in the that can be exploited instead.
To illustrate the key ideas (and to help motivate the arguments for the general case) we will also make the following “regularity” hypotheses: there exists two quantities (which we will refer to as the energy and mass of the tree respectively) for which we have the upper bounds
for all ; informally, these estimates assert that is size “on average” on the tiles in the tree, and similarly that has density on all tiles in the tree. (These are slightly oversimplified versions of the energy and mass concept; we will refine these notions later.) For technical reasons we also need to generalise (28) to
for any tile with for some . (Informally, (29) asserts that a sort of “Hardy-Littlewood maximal function” of is bounded by on the tree.)
We also assume that we have the reverse bounds for the tree top:
It will be through a combination of both these lower and upper bounds that we can obtain a bound (19) that does not involve either or .
We will use (27), (28), (29) to establish the tree estimate
Note from (30) and Cauchy-Schwarz that
and from (31) and Cauchy-Schwarz one similarly has
and so (32) recovers the desired estimate (19).
It remains to establish the tree estimate (32). It will be convenient to use the tree to partition the real line into dyadic intervals that are naturally “adapted to” the geometry of the tree (or more precisely to the spatial intervals of the tree) in a certain way (in a manner reminiscent of a Whitney decomposition).
Exercise 11 (Whitney-type decomposition associated to a tree) Let be a non-empty tree. Show that there exists a family of dyadic intervals with the following properties:
- (i) The intervals in form a partition of (up to sets of measure zero).
- (ii) For each and any with , we have .
- (iii) For each , there exists with and .
(Hint: one can choose to be the collection of all dyadic intervals whose dilate does not contain any , and which is maximal with respect to set inclusion.)
We can of course assume that the tree is non-empty, since (32) is trivial for empty sets of tiles. We apply the partition from Exercise 11. By the triangle inequality, we can bound the left hand side of (32) by
which by (27), (22) may be bounded by
We first dispose of the narrow tiles in which . By Exercise 11(ii) this forces . From (28) we have
(say). For each fixed spatial scale , the intervals in the tree are all essentially disjoint, so a routine calculation then shows
(say), so that
which from Exercise 11(ii) implies that the contribution of the case to (32) is acceptable.
Now we consider the wide tiles in which . From Exercise 11(ii) this case is only possible if and . Thus the are now restricted to an interval of length , and it will suffice to establish the local estimate
for each . Note that for each fixed spatial scale , there is at most one choice of frequency interval with and , thus for fixed the set is independent of . We may then sum in for each such scale to conclude
Now we make the crucial observation that in a -tree , the intervals are all essentially disjoint, hence the are disjoint as well. As these sets are also contained in , we conclude that
From Exercise 11(iii) and (29) (choosing a tile with spatial scale and within of , and with for the tile provided by Exercise 11(iii)) we have
giving the claim.
The case of a regular -tree
We now complement the previous case by establishing (19) for (certain types of) -trees . The situation is now reversed: there is a lot of “collision” in the , but on the other hand there is now some “orthogonality” in the that can be exploited.
As before we will assume some regularity on the -tree , namely that there exist for which one has the upper bounds
for all (note this is slightly stronger than (27)), as well as the bound (29) for any tile with for some . We complement this with the matching lower bounds
and (31).
As before we will focus on establishing the tree estimate (32). From (31) and Cauchy-Schwarz as before we have
As we now have a -tree, the tiles become disjoint (up to null sets), and we can obtain an almost orthogonality estimate:
Exercise 12 (Almost orthogonality) For any -tree , show that
for all complex numbers , and use this to deduce the Bessel-type inequality
From this exercise and (34) we see that
and so the desired bound (19) will follow from the tree estimate (32).
In this case it will be convenient to linearise the sum to remove the absolute value signs; more precisely, to show (32) it suffices to show that
for any complex numbers of magnitude . Again we may assume that the tree is non-empty, and use the partition from Exercise 11, to split the left-hand side as
The contribution of the narrow tiles can be disposed of as before without any additional difficulty, so we focus on estimating the contribution
of the wide tiles. As before, in order for this sum to be non-empty has to be contained in an neighbourhood of .
The main difficulty here is the dependence of on . We rewrite
so that the above expression can be written as
Now for a key geometric observation: the intervals are nested (and decrease when increases), so the condition is equivalent to a condition of the form for some scale depending on . Thus the above sum can be written as
The point of this observation is that the integrand can now be expressed as a sort of “Littlewood-Paley projection” of the function
to the region of frequency space corresponding to those intervals with :
Exercise 13 Establish the pointwise estimate
for all where ranges over all intervals (not necessarily dyadic) containing .
From (29) and Exercise 11(iii) as before we have
and so we can bound the expression \eqerf{j-sum} by
which one can bound in terms of the Hardy-Littlewood maximal function of , followed by Cauchy-Schwarz and the Hardy-Littlewood inequality, and finally Exercise 12, as
On the other hand, from (33) we have
for every . By grouping the tiles in according to their maximal elements (which necessarily have essentially disjoint spatial intervals) and applying the above inequality to each such group and summing, we conclude that
and the tree estimate (32) follows.
The general case
We are now ready to handle the general case of an arbitrary finite collection of tiles. Motivated by the previous discussion, we define two quantities:
Definition 14 (Energy and mass) For any non-empty finite collection of tiles, we define the energy to be the quantity
where ranges over all -trees in , and the mass to be the quantity
where is the set
(thus for instance ). By convention, we declare the empty set of tiles to have energy and mass equal to zero.
Note here that the definition of mass has been modified slightly from previous arguments, in that we now use instead of . However, this turns out to be an acceptable modification, in the sense that we still continue to have the analogue of (32):
Since has an norm of , we also have the trivial bound
for any finite collection of tiles .
The strategy is now to try to partition an arbitrary family of tiles into collections of disjoint trees (or “forests”, if you will) whose energy , mass , and spatial scale are all under control, apply Exercise 15 to each tree, and sum. To do this we rely on two key selection results, which are vaguely reminiscent of the Calderón-Zygmund decomposition:
Proposition 16 (Energy selection) Let be a collection of tiles with
for some . Then one can partition into a collection of disjoint trees with
together with a remainder set with
Proposition 17 (Mass selection) Let be a collection of tiles with
for some . Then one can partition into a collection of disjoint trees with
together with a remainder set with
Let us assume these two propositions for now and see how these (together with Exercise 15) establishes the required estimate (19) for an arbitrary collection of tiles. We may assume without loss of generality that and are non-zero. Rearranging the above two propositions slightly, we see that if is a finite collection of tiles such that
for some integer then after applying Proposition 16 followed by Proposition 17, we can partition into a disjoint collection of trees with
together with a remainder with
Note that any finite collection of tiles will obey (38) for some sufficiently large and negative . Starting with this and then iterating indefinitely, and discarding any empty families, we can therefore partition any finite collection of tiles as
where are collections of trees (empty for all but finitely many ) such that
and (39) holds, and is a residual collection of tiles with
We can then bound the left-hand side of (19) by
From Exercise 15 applied to individual tiles and (41) we see that the second term in this expression vanishes. For the first term, we use Exercise 15, (40), (36) to bound this sum by
which by (39) is bounded by
which sums to as required.
It remains to establish the energy and mass selection lemmas. We begin with the mass selection claim, Proposition 17. Let denote the set of all tiles with for some and such that
Let denote the set of tiles in that are maximal with respect to the tile partial order. (Note that the left-hand side of (42) is bounded by , so there is an upper bound to the spatial scales of the tiles involved here.) Then every tile in is either less than or equal to a tile in , or is such that
for all . Thus if we let be the collection of tiles of the second form, and let be the collection of trees with tree top associated to each , we obtain the required partition
with
and it remains to establish the bound
This will be a (rather heavily disguised) variant of the Hardy-Littlewood maximal inequality. By construction, the tree tops are essentially disjoint, and one has
for all such tree tops. To motivate the argument, suppose for sake of discussion that we had the stronger estimate
By the essential disjointness of the , the sets
are also essentially disjoint subsets of , hence
and the claim (43) would then follow. Now we do not quite have (44); but from the pigeonhole principle we see that for each there is a natural number such that
(say), where denotes the interval with the same center as but times the length (this is not quite a dyadic interval). We now restrict attention to those associated to a fixed choice of . Let denote the corresponding dilated tiles, then we have
Unfortunately, the are no longer disjoint. However, by the greedy algorithm (repeatedly choosing maximal tiles (in the tile ordering)), we can find a collection such that
- (i) All the dilated tree tops are essentially disjoint.
- (ii) For every with , there is such that intersects and .
From property (i) and (45) we have
On the other hand, from property (ii) we see that the sum of all the for all with associated to a single is . Putting the two statements together we see that
and on summing in we obtain the required claim (43).
Finally, we prove the energy selection claim, Proposition 16. The basic idea is to extract all the high-energy trees from in such a way that the -tree component of those trees are sufficiently “disjoint” from each other that a useful Bessel inequality, generalising Exercise 12, may be deployed. Implementing this strategy correctly turns out however to be slightly delicate. We perform the following iterative algorithm to generate a partition
as well as a companion collection of -trees as follows.
- Step 1. Initialise and .
- Step 2. If then STOP. Otherwise, go on to Step 3.
- Step 3. Since we now have , contains a -tree for which
Among all such , choose one for which the midpoint of the frequency is minimal. (The reason for this rather strange choice will be made clearer shortly.)
- Step 4. Add to , add the larger tree (with the same top as ) to , then remove from . We also remove the adjacent trees and from and also place them into . Now return to Step 2.
This procedure terminates in finite time to give a partition (46) with , and with the trees coming in triplets all associated to a -tree in with the same spatial scale as , with all the -trees disjoint and obeying the estimates
(both the upper and lower bounds will be important for this argument). It will then suffice to show that
by (48), it then suffices to show the Bessel type inequality
Now we make a crucial observation: not only are the trees in disjoint (in the sense that no tile belongs to two of these trees), but the lower tiles are also essentially disjoint. Indeed we claim an even stronger disjointness property: if , are such that , then is not only disjoint from the larger dyadic interval , but is in fact disjoint from the even larger interval . To see this, suppose for contradiction that and . There are three possibilities to rule out:
- is equal to . This can be ruled out because any two lower frequency intervals associated to a -tree are either equal or disjoint.
- was selected after was. To rule this out, observe that contains the parent of , and hence , , or . Thus, when was selected, should have been placed with one of the three trees associated to and would therefore not have been available for inclusion into , a contradiction.
- was selected before was. If this case held, then the midpoint of would have to be greater than or equal to that of , otherwise would not have a minimal midpoint at the time of its selection. But is contained in , which is contained in , which lies below , which contains , which contains the midpoint of ; thus the midpoint of lies strictly below that of , a contradiction.
If the were perfectly orthogonal to each other, this disjointness would be more than enough to establish (49). Unfortunately we only have imperfect orthogonality, and we have to work a little harder. As usual, we turn to a type argument. We can write the left-hand side of (49) as
so by Cauchy-Schwarz it suffices to show that
By the triangle inequality, the left-hand side may be bounded by
As has Fourier support in , we see that vanishes unless and overlap. By symmetry it suffices to consider the cases and .
First let us consider the contribution of . Using Young’s inequality and symmetry, we may bound this contribution by
A direct calculation using (22) reveals that
so the contribution of this case is at most
as desired.
Now we deal with the case when , which by the preceding discussion implies that and lies outside of . Here we use (37) to bound
and
and then we can bound this contribution by
Direct calculation using (22) reveals that
(say), and also
so we obtain a bound of
which is acceptable by (48). This finally finishes the proof of Proposition 16, which in turn completes the proof of Carleson’s theorem.
Remark 18 The Lacey-Thiele proof of Carleson’s theorem given above relies on a decomposition of a tileset in a way that controls both energy and mass. The original proof of Carleson dispenses with mass (or with the function ), and focuses on controlling maximal operators that (in our notation) are basically of the form
To control such functions, one iterates a decomposition similar to Proposition 16 to partition into trees with good energy control, and establishes pointwise control of the contribution of each tree outside of an exceptional set. See Section 4 of this article of Demeter for an exposition in the simplified setting of Walsh-Fourier analysis. The proof of Fefferman takes the opposite tack, dispensing with energy and focusing on bounding the operator norm of the linearised operator
Roughly speaking, the strategy is to iterate a version of Proposition 16 for partition into “forests” of disjoint trees, though in Fefferman’s argument some additional work is invested into obtaining even better disjointness properties on these forests than is given here. See Section 5 of this article of Demeter for an exposition in the simplified setting of Walsh-Fourier analysis.
A modification of the above arguments used to establish the weak estimate can also establish restricted weak-type estimates for any :
Exercise 19 For any sets of finite measure, and any measurable function , show that
for any . (Hint: repeat the previous analysis with , but supplement it with an additional energy bound coming from a suitably localised version of Exercise 12.)
The bound (51) is also true for , yielding Hunt’s theorem, but this requires some additional arguments of Calderón-Zygmund type, involving the removal of an exceptional set defined using the Hardy-Littlewood maximal function:
Exercise 20 (Hunt’s theorem) Let be of finite non-zero measure, and let be a measurable function. Let be the exceptional set
for a large absolute constant ; note from the Hardy-Littlewood inequality that if is large enough.
- (i) If be a finite collection of tiles with for all , show that
(Hint: By using (22) and the disjointness of the when is fixed, first establish the estimate
whenever is a natural number and is an interval with and \|f\|_{L^2(R)} |E|^{1/2}.)
- (ii) If be a finite collection of tiles with for all , show that . (For a given tree , one can introduce the dyadic intervals as in Exercise 11, then perform a Calderón-Zygmund type decomposition to , splitting it into a “good” function bounded pointwise by , plus “bad functions” that are supported on the intervals and have mean zero. See this paper of Grafakos, Terwilleger, and myself for details.)
- (iii) For any finite collection of tiles for all
- (iv) Show that (51) holds for all , and conclude Theorem 2(iii).
Remark 21 The methods of time-frequency analysis given here can handle several other operators that, like the Carleson operator, exhibit scaling, translation, and frequency modulation symmetries. One model example is the bilinear Hilbert transform
for . The methods in this set of notes were used by Lacey and Thiele to establish the estimates
for with (these estimates have since been strengthened and extended in a number of ways). We only give the briefest of sketches here. Much as how Carleson’s theorem can be reduced to a bound (19), the above estimates can be reduced to the estimation of a model sum
where is a certain collection of triples of tiles with common spatial interval and frequency intervals varying along a certain one-parameter family for each fixed choice of spatial interval. One then uses a variant of Proposition 16 to partition into “-trees”, “-trees”, and “-trees”, the contribution of each of which can be controlled by the energies of on such trees, times the length of the spatial support of the tree, in analogy with Exercise 15. See for instance the text of Muscalu and Schlag for more discussion and further results.
Remark 22 The concepts of mass and energy can be abstracted into a framework of spaces associated to outer measures (as opposed to the classical setup of spaces associated to countably additive measures), in which the mass and energy selection propositions can be viewed as consequences of an abstract Carleson embedding theorem, and the calculations establishing estimates such as (19) from such propositions and a tree estimate can be viewed as consequences of an “outer Hölder inequality”. See this paper of Do and Thiele for details.