A generalized Cauchy-Schwarz inequality via the Gibbs variational formula

What's new 2023-12-11

Let ${S}$ be a non-empty finite set. If ${X}$ is a random variable taking values in ${S}$ , the Shannon entropy ${H[X]}$ of ${X}$ is defined as

$\displaystyle H[X] = -\sum_{s \in S} {\bf P}[X = s] \log {\bf P}[X = s].$

There is a nice variational formula that lets one compute logs of sums of exponentials in terms of this entropy:

Lemma 1 (Gibbs variational formula) Let ${f: S \rightarrow {\bf R}}$ be a function. Then
$\displaystyle \log \sum_{s \in S} \exp(f(s)) = \sup_X {\bf E} f(X) + {\bf H}[X]. \ \ \ \ \ (1)$

Proof: Note that shifting ${f}$ by a constant affects both sides of (1) the same way, so we may normalize ${\sum_{s \in S} \exp(f(s)) = 1}$ . Then ${\exp(f(s))}$ is now the probability distribution of some random variable ${Y}$ , and the inequality can be rewritten as

$\displaystyle 0 = \sup_X \sum_{s \in S} {\bf P}[X = s] \log {\bf P}[Y = s] -\sum_{s \in S} {\bf P}[X = s] \log {\bf P}[X = s].$

But this is precisely the Gibbs inequality. (The expression inside the supremum can also be written as ${-D_{KL}(X||Y)}$ , where ${D_{KL}}$ denotes Kullback-Liebler divergence. One can also interpret this inequality as a special case of the Fenchel–Young inequality relating the conjugate convex functions ${x \mapsto e^x}$ and ${y \mapsto y \log y - y}$ .) $\Box$

In this note I would like to use this variational formula (which is also known as the Donsker-Varadhan variational formula) to give another proof of the following inequality of Carbery.

Theorem 2 (Generalized Cauchy-Schwarz inequality) Let ${n \geq 0}$ , let ${S, T_1,\dots,T_n}$ be finite non-empty sets, and let ${\pi_i: S \rightarrow T_i}$ be functions for each ${i=1,\dots,n}$ . Let ${K: S \rightarrow {\bf R}^+}$ and ${f_i: T_i \rightarrow {\bf R}^+}$ be positive functions for each ${i=1,\dots,n}$ . Then
$\displaystyle \sum_{s \in S} K(s) \prod_{i=1}^n f_i(\pi_i(s)) \leq Q \prod_{i=1}^n (\sum_{t_i \in T_i} f_i(t_i)^{n+1})^{1/(n+1)}$
where ${Q}$ is the quantity
$\displaystyle Q := (\sum_{(s_0,\dots,s_n) \in \Omega_n} K(s_0) \dots K(s_n))^{1/(n+1)}$
where ${\Omega_n}$ is the set of all tuples ${(s_0,\dots,s_n) \in S^{n+1}}$ such that ${\pi_i(s_{i-1}) = \pi_i(s_i)}$ for ${i=1,\dots,n}$ .

Thus for instance, the identity is trivial for ${n=0}$ . When ${n=1}$ , the inequality reads

$\displaystyle \sum_{s \in S} K(s) f_1(\pi_1(s)) \leq (\sum_{s_0,s_1 \in S: \pi_1(s_0)=\pi_1(s_1)} K(s_0) K(s_1))^{1/2}$

$\displaystyle ( \sum_{t_1 \in T_1} f_1(t_1)^2)^{1/2},$

which is easily proven by Cauchy-Schwarz, while for ${n=2}$ the inequality reads

$\displaystyle \sum_{s \in S} K(s) f_1(\pi_1(s)) f_2(\pi_2(s))$

$\displaystyle \leq (\sum_{s_0,s_1, s_2 \in S: \pi_1(s_0)=\pi_1(s_1); \pi_2(s_1)=\pi_2(s_2)} K(s_0) K(s_1) K(s_2))^{1/3}$

$\displaystyle (\sum_{t_1 \in T_1} f_1(t_1)^3)^{1/3} (\sum_{t_2 \in T_2} f_2(t_2)^3)^{1/3}$

which can also be proven by elementary means. However even for ${n=3}$ , the existing proofs require the “tensor power trick” in order to reduce to the case when the ${f_i}$ are step functions (in which case the inequality can be proven elementarily, as discussed in the above paper of Carbery).

We now prove this inequality. We write ${K(s) = \exp(k(s))}$ and ${f_i(t_i) = \exp(g_i(t_i))}$ for some functions ${k: S \rightarrow {\bf R}}$ and ${g_i: T_i \rightarrow {\bf R}}$ . If we take logarithms in the inequality to be proven and apply Lemma 1, the inequality becomes

$\displaystyle \sup_X {\bf E} k(X) + \sum_{i=1}^n g_i(\pi_i(X)) + {\bf H}[X]$

$\displaystyle \leq \frac{1}{n+1} \sup_{(X_0,\dots,X_n)} {\bf E} k(X_0)+\dots+k(X_n) + {\bf H}[X_0,\dots,X_n]$

$\displaystyle + \frac{1}{n+1} \sum_{i=1}^n \sup_{Y_i} (n+1) {\bf E} g_i(Y_i) + {\bf H}[Y_i]$

where ${X}$ ranges over random variables taking values in ${S}$ , ${X_0,\dots,X_n}$ range over tuples of random variables taking values in ${\Omega_n}$ , and ${Y_i}$ range over random variables taking values in ${T_i}$ . Comparing the suprema, the claim now reduces to

Lemma 3 (Conditional expectation computation) Let ${X}$ be an ${S}$ -valued random variable. Then there exists a ${\Omega_n}$ -valued random variable ${(X_0,\dots,X_n)}$ , where each ${X_i}$ has the same distribution as ${X}$ , and
$\displaystyle {\bf H}[X_0,\dots,X_n] = (n+1) {\bf H}[X]$

$\displaystyle - {\bf H}[\pi_1(X)] - \dots - {\bf H}[\pi_n(X)].$

Proof: We induct on ${n}$ . When ${n=0}$ we just take ${X_0 = X}$ . Now suppose that ${n \geq 1}$ , and the claim has already been proven for ${n-1}$ , thus one has already obtained a tuple ${(X_0,\dots,X_{n-1}) \in \Omega_{n-1}}$ with each ${X_0,\dots,X_{n-1}}$ having the same distribution as ${X}$ , and

$\displaystyle {\bf H}[X_0,\dots,X_{n-1}] = n {\bf H}[X] - {\bf H}[\pi_1(X)] - \dots - {\bf H}[\pi_{n-1}(X)].$

By hypothesis, ${\pi_n(X_{n-1})}$ has the same distribution as ${\pi_n(X)}$ . For each value ${t_n}$ attained by ${\pi_n(X)}$ , we can take conditionally independent copies of ${(X_0,\dots,X_{n-1})}$ and ${X}$ conditioned to the events ${\pi_n(X_{n-1}) = t_n}$ and ${\pi_n(X) = t_n}$ respectively, and then concatenate them to form a tuple ${(X_0,\dots,X_n)}$ in ${\Omega_n}$ , with ${X_n}$ a further copy of ${X}$ that is conditionally independent of ${(X_0,\dots,X_{n-1})}$ relative to ${\pi_n(X_{n-1}) = \pi_n(X)}$ . One can the use the entropy chain rule to compute

$\displaystyle {\bf H}[X_0,\dots,X_n] = {\bf H}[\pi_n(X_n)] + {\bf H}[X_0,\dots,X_n| \pi_n(X_n)]$

$\displaystyle = {\bf H}[\pi_n(X_n)] + {\bf H}[X_0,\dots,X_{n-1}| \pi_n(X_n)] + {\bf H}[X_n| \pi_n(X_n)]$

$\displaystyle = {\bf H}[\pi_n(X)] + {\bf H}[X_0,\dots,X_{n-1}| \pi_n(X_{n-1})] + {\bf H}[X_n| \pi_n(X_n)]$

$\displaystyle = {\bf H}[\pi_n(X)] + ({\bf H}[X_0,\dots,X_{n-1}] - {\bf H}[\pi_n(X_{n-1})])$

$\displaystyle + ({\bf H}[X_n] - {\bf H}[\pi_n(X_n)])$

$\displaystyle ={\bf H}[X_0,\dots,X_{n-1}] + {\bf H}[X_n] - {\bf H}[\pi_n(X_n)]$

and the claim now follows from the induction hypothesis. $\Box$

With a little more effort, one can replace ${S}$ by a more general measure space (and use differential entropy in place of Shannon entropy), to recover Carbery’s inequality in full generality; we leave the details to the interested reader.