A generalized Cauchy-Schwarz inequality via the Gibbs variational formula

What's new 2023-12-11

Let {S} be a non-empty finite set. If {X} is a random variable taking values in {S}, the Shannon entropy {H[X]} of {X} is defined as

\displaystyle H[X] = -\sum_{s \in S} {\bf P}[X = s] \log {\bf P}[X = s].

There is a nice variational formula that lets one compute logs of sums of exponentials in terms of this entropy:

Lemma 1 (Gibbs variational formula) Let {f: S \rightarrow {\bf R}} be a function. Then

\displaystyle  \log \sum_{s \in S} \exp(f(s)) = \sup_X {\bf E} f(X) + {\bf H}[X]. \ \ \ \ \ (1)

Proof: Note that shifting {f} by a constant affects both sides of (1) the same way, so we may normalize {\sum_{s \in S} \exp(f(s)) = 1}. Then {\exp(f(s))} is now the probability distribution of some random variable {Y}, and the inequality can be rewritten as

\displaystyle  0 = \sup_X \sum_{s \in S} {\bf P}[X = s] \log {\bf P}[Y = s] -\sum_{s \in S} {\bf P}[X = s] \log {\bf P}[X = s].

But this is precisely the Gibbs inequality. (The expression inside the supremum can also be written as {-D_{KL}(X||Y)}, where {D_{KL}} denotes Kullback-Liebler divergence. One can also interpret this inequality as a special case of the Fenchel–Young inequality relating the conjugate convex functions {x \mapsto e^x} and {y \mapsto y \log y - y}.) \Box

In this note I would like to use this variational formula (which is also known as the Donsker-Varadhan variational formula) to give another proof of the following inequality of Carbery.

Theorem 2 (Generalized Cauchy-Schwarz inequality) Let {n \geq 0}, let {S, T_1,\dots,T_n} be finite non-empty sets, and let {\pi_i: S \rightarrow T_i} be functions for each {i=1,\dots,n}. Let {K: S \rightarrow {\bf R}^+} and {f_i: T_i \rightarrow {\bf R}^+} be positive functions for each {i=1,\dots,n}. Then

\displaystyle  \sum_{s \in S} K(s) \prod_{i=1}^n f_i(\pi_i(s)) \leq Q \prod_{i=1}^n (\sum_{t_i \in T_i} f_i(t_i)^{n+1})^{1/(n+1)}

where {Q} is the quantity

\displaystyle  Q := (\sum_{(s_0,\dots,s_n) \in \Omega_n} K(s_0) \dots K(s_n))^{1/(n+1)}

where {\Omega_n} is the set of all tuples {(s_0,\dots,s_n) \in S^{n+1}} such that {\pi_i(s_{i-1}) = \pi_i(s_i)} for {i=1,\dots,n}.

Thus for instance, the identity is trivial for {n=0}. When {n=1}, the inequality reads

\displaystyle  \sum_{s \in S} K(s) f_1(\pi_1(s)) \leq (\sum_{s_0,s_1 \in S: \pi_1(s_0)=\pi_1(s_1)} K(s_0) K(s_1))^{1/2}

\displaystyle  ( \sum_{t_1 \in T_1} f_1(t_1)^2)^{1/2},

which is easily proven by Cauchy-Schwarz, while for {n=2} the inequality reads

\displaystyle  \sum_{s \in S} K(s) f_1(\pi_1(s)) f_2(\pi_2(s))

\displaystyle  \leq (\sum_{s_0,s_1, s_2 \in S: \pi_1(s_0)=\pi_1(s_1); \pi_2(s_1)=\pi_2(s_2)} K(s_0) K(s_1) K(s_2))^{1/3}

\displaystyle (\sum_{t_1 \in T_1} f_1(t_1)^3)^{1/3} (\sum_{t_2 \in T_2} f_2(t_2)^3)^{1/3}

which can also be proven by elementary means. However even for {n=3}, the existing proofs require the “tensor power trick” in order to reduce to the case when the {f_i} are step functions (in which case the inequality can be proven elementarily, as discussed in the above paper of Carbery).

We now prove this inequality. We write {K(s) = \exp(k(s))} and {f_i(t_i) = \exp(g_i(t_i))} for some functions {k: S \rightarrow {\bf R}} and {g_i: T_i \rightarrow {\bf R}}. If we take logarithms in the inequality to be proven and apply Lemma 1, the inequality becomes

\displaystyle  \sup_X {\bf E} k(X) + \sum_{i=1}^n g_i(\pi_i(X)) + {\bf H}[X]

\displaystyle  \leq \frac{1}{n+1} \sup_{(X_0,\dots,X_n)} {\bf E} k(X_0)+\dots+k(X_n) + {\bf H}[X_0,\dots,X_n]

\displaystyle  + \frac{1}{n+1} \sum_{i=1}^n \sup_{Y_i} (n+1) {\bf E} g_i(Y_i) + {\bf H}[Y_i]

where {X} ranges over random variables taking values in {S}, {X_0,\dots,X_n} range over tuples of random variables taking values in {\Omega_n}, and {Y_i} range over random variables taking values in {T_i}. Comparing the suprema, the claim now reduces to

Lemma 3 (Conditional expectation computation) Let {X} be an {S}-valued random variable. Then there exists a {\Omega_n}-valued random variable {(X_0,\dots,X_n)}, where each {X_i} has the same distribution as {X}, and

\displaystyle  {\bf H}[X_0,\dots,X_n] = (n+1) {\bf H}[X]

\displaystyle - {\bf H}[\pi_1(X)] - \dots - {\bf H}[\pi_n(X)].

Proof: We induct on {n}. When {n=0} we just take {X_0 = X}. Now suppose that {n \geq 1}, and the claim has already been proven for {n-1}, thus one has already obtained a tuple {(X_0,\dots,X_{n-1}) \in \Omega_{n-1}} with each {X_0,\dots,X_{n-1}} having the same distribution as {X}, and

\displaystyle  {\bf H}[X_0,\dots,X_{n-1}] = n {\bf H}[X] - {\bf H}[\pi_1(X)] - \dots - {\bf H}[\pi_{n-1}(X)].

By hypothesis, {\pi_n(X_{n-1})} has the same distribution as {\pi_n(X)}. For each value {t_n} attained by {\pi_n(X)}, we can take conditionally independent copies of {(X_0,\dots,X_{n-1})} and {X} conditioned to the events {\pi_n(X_{n-1}) = t_n} and {\pi_n(X) = t_n} respectively, and then concatenate them to form a tuple {(X_0,\dots,X_n)} in {\Omega_n}, with {X_n} a further copy of {X} that is conditionally independent of {(X_0,\dots,X_{n-1})} relative to {\pi_n(X_{n-1}) = \pi_n(X)}. One can the use the entropy chain rule to compute

\displaystyle  {\bf H}[X_0,\dots,X_n] = {\bf H}[\pi_n(X_n)] + {\bf H}[X_0,\dots,X_n| \pi_n(X_n)]

\displaystyle  = {\bf H}[\pi_n(X_n)] + {\bf H}[X_0,\dots,X_{n-1}| \pi_n(X_n)] + {\bf H}[X_n| \pi_n(X_n)]

\displaystyle  = {\bf H}[\pi_n(X)] + {\bf H}[X_0,\dots,X_{n-1}| \pi_n(X_{n-1})] + {\bf H}[X_n| \pi_n(X_n)]

\displaystyle  = {\bf H}[\pi_n(X)] + ({\bf H}[X_0,\dots,X_{n-1}] - {\bf H}[\pi_n(X_{n-1})])

\displaystyle + ({\bf H}[X_n] - {\bf H}[\pi_n(X_n)])

\displaystyle  ={\bf H}[X_0,\dots,X_{n-1}] + {\bf H}[X_n] - {\bf H}[\pi_n(X_n)]

and the claim now follows from the induction hypothesis. \Box

With a little more effort, one can replace {S} by a more general measure space (and use differential entropy in place of Shannon entropy), to recover Carbery’s inequality in full generality; we leave the details to the interested reader.