Entropy of a Mixture
Given a pair ( p 0 , p 1 ) (p_0, p_1) (p0,p1) of probability density functions and an interpolation factor λ ∈ [ 0 , 1 ] , \lambda \in [0, 1], λ∈[0,1], consider the mixture p λ = ( 1 − λ ) p 0 + λ p 1 . p_\lambda = (1 - \lambda) p_0 + \lambda p_1. pλ=(1−λ)p0+λp1. How does the entropy H ( p λ ) = − E X ∼ p λ ln p λ ( X ) H(p_\lambda) = -\E_{X \sim p_\lambda} \ln p_\lambda(X) H(pλ)=−EX∼pλlnpλ(X) of this mixture vary as a function of λ \lambda λ? The widget below shows what this situation looks like for a pair of discrete distributions on the plane. (Drag to adjust λ . \lambda. λ.)
It turns out that entropy is concave as a function of probabilities. This explains the "upward bulge" of our curve. It might also be intuitive to you that, when p 0 p_0 p0 and p 1 p_1 p1 are less similar, this curve should bulge upwards more. What exactly does its shape tell us? In this short post, we'll see how questions about H ( p λ ) H(p_\lambda) H(pλ) lead to information about the relationship between p 0 p_0 p0 and p 1 . p_1. p1. For example, we'll see how JSD divergence, KL divergence, and χ 2 \chi^2 χ2 divergence can all be read off from H ( p λ ) . H(p_\lambda). H(pλ). Actually, this function seems to hold a lot of information about the relationship between p 0 p_0 p0 and p 1 p_1 p1—perhaps everything you would ever want to know.
Entropy being concave means that the inequality H ( p λ ) ≥ ( 1 − λ ) H ( p 0 ) + λ H ( p 1 ) H(p_\lambda) \ge (1 - \lambda) H(p_0) + \lambda H(p_1) H(pλ)≥(1−λ)H(p0)+λH(p1) holds for all pairs ( p 0 , p 1 ) (p_0, p_1) (p0,p1) and scalars λ ∈ [ 0 , 1 ] . \lambda \in [0, 1]. λ∈[0,1]. Actually, this kind of inequality has a very important interpretation: in general, the gap in a Jensen inequality on entropy measures mutual information.
To see how this works, let's introduce a Bernoulli variable A . A. A. Let A = 1 A = 1 A=1 with probability λ \lambda λ and A = 0 A = 0 A=0 otherwise, and let X X X be a draw from p 1 p_1 p1 when A = 1 A = 1 A=1 and a draw from p 0 p_0 p0 when A = 0. A = 0. A=0. Overall, X ∼ p λ . X \sim p_\lambda. X∼pλ. In terms of the variables X X X and A , A, A, the weighted average of entropies λ H ( p 0 ) + ( 1 − λ ) H ( p 1 ) \lambda H(p_0) + (1 - \lambda) H(p_1) λH(p0)+(1−λ)H(p1) is called the conditional entropy H ( X ∣ A ) . H(X|A). H(X∣A). We can interpret conditional entropy as the average surprisal we pay on our prediction of X X X if we're given the value of A A A in advance. Our inequality above states that H ( X ∣ A ) H(X|A) H(X∣A) is never larger than H ( X ) , H(X), H(X), which I hope is intuitive information-theoretically! Mutual information I ( X ; A ) I(X;A) I(X;A) measures the gap H ( X ) − H ( X ∣ A ) H(X) - H(X|A) H(X)−H(X∣A); in other words, H ( p λ ) = ( 1 − λ ) H ( p 0 ) + λ H ( p 1 ) + I ( X ; A ) . H(p_\lambda) = (1 - \lambda) H(p_0) + \lambda H(p_1) + I(X;A). H(pλ)=(1−λ)H(p0)+λH(p1)+I(X;A). For two variables X X X and A A A in general, we would write H ( E A p ( x ∣ A ) ) = E A H ( p ( x ∣ A ) ) + I ( X ; A ) H(\E_A p(x|A)) = \E_A H(p(x|A)) + I(X;A) H(EAp(x∣A))=EAH(p(x∣A))+I(X;A) where p ( x ∣ A ) p(x|A) p(x∣A) is understood as a random distribution depending on A . A. A. The bound H ( E A p ( x ∣ A ) ) ≤ E A H ( p ( x ∣ A ) ) H(E_A p(x|A)) \le E_A H(p(x|A)) H(EAp(x∣A))≤EAH(p(x∣A)) implied by non-negativity of mutual information is called a Jensen inequality, and it follows from concavity of H . H. H.
Now, the idea that I ( X ; A ) I(X;A) I(X;A) measures how much knowing A A A decreases our expected surprisal about X X X leads to a way to express I ( X ; A ) I(X;A) I(X;A) in terms of KL divergences between the distributions p 0 , p_0, p0, p 1 p_1 p1 and p λ . p_\lambda. pλ. Explicitly, I ( X ; A ) = E A , X [ ln p ( X ) − ln p ( X ∣ A ) ] = E A KL ( posterior on X ∥ prior on X ) = ( 1 − λ ) KL ( p 0 ∥ p λ ) + λ KL ( p 1 ∥ p λ ) . \begin{align*} I(X;A) & = \E_{A,X} \bigl[ \ln p(X) - \ln p(X|A) \bigr] \\ & = \E_A \KL(\text{posterior on } X \Vert \text{prior on } X) \\ & = (1 - \lambda) \KL(p_0 \Vert p_\lambda) + \lambda \KL(p_1 \Vert p_\lambda). \end{align*} I(X;A)=EA,X[lnp(X)−lnp(X∣A)]=EAKL(posterior on X∥prior on X)=(1−λ)KL(p0∥pλ)+λKL(p1∥pλ). We can also put our two variables in the other order and ask how much X X X tells us about A . A. A. Where KL ( x ∥ y ) = x ln ( x / y ) + ( 1 − x ) ln ( ( 1 − x ) / ( 1 − y ) ) \KL(x \Vert y) = x \ln(x/y) + (1 - x) \ln((1 - x)/(1-y)) KL(x∥y)=xln(x/y)+(1−x)ln((1−x)/(1−y)) denotes the KL divergence between two Bernoulli distributions with rates x x x and y , y, y, we find that I ( X ; A ) = E X KL ( posterior on A ∥ prior on A ) = E X KL ( λ p 1 p λ ∥ λ ) = E X [ − ( 1 − λ ) p 0 p λ ( ln ( 1 − λ ) − ln ( 1 − λ ) p 0 p λ ) = ∫ − λ p 1 p λ ( ln λ − ln λ p 1 p λ ) ] = E X [ ( 1 − λ ) p 0 p λ ln p 0 p λ + λ p 1 p λ ln p 1 p λ ] , \begin{align*} I(X;A) & = \E_X \KL(\text{posterior on } A \Vert \text{prior on } A) \\ & = \E_X \KL\left( \frac{\lambda p_1}{p_\lambda} \middle \Vert \lambda \right) \\ & = \E_X \biggl[ -\frac{(1 - \lambda) p_0}{p_\lambda} \left( \ln (1 - \lambda) - \ln \frac{(1 - \lambda) p_0}{p_\lambda} \right) \\ & \phantom{=\int} -\frac{\lambda p_1}{p_\lambda} \left( \ln \lambda - \ln \frac{\lambda p_1}{p_\lambda} \right) \biggr] \\ & = \E_X \biggl[ \frac{(1 - \lambda) p_0}{p_\lambda} \ln \frac{p_0}{p_\lambda} +\frac{\lambda p_1}{p_\lambda} \ln \frac{p_1}{p_\lambda} \biggr], \end{align*} I(X;A)=EXKL(posterior on A∥prior on A)=EXKL(pλλp1 λ)=EX[−pλ(1−λ)p0(ln(1−λ)−lnpλ(1−λ)p0)=∫−pλλp1(lnλ−lnpλλp1)]=EX[pλ(1−λ)p0lnpλp0+pλλp1lnpλp1], which is the same as our earlier expression with the order of integration swapped. (Just write the expectation as an integral and cancel out the factors of p λ p_\lambda pλ!) There are a lot of ways to think about this kind of expression.
After a little more rearranging, we can also show that I ( X ; A ) I(X;A) I(X;A) is an f f f-divergence between the distributions p 0 p_0 p0 and p 1 , p_1, p1, meaning it can be written in the form E p 0 f ( p 0 / p 1 ) \E_{p_0} f(p_0/p_1) Ep0f(p0/p1) for some convex function f f f with a few technical properties. When λ = 1 / 2 , \lambda = 1/2, λ=1/2, I ( X ; A ) I(X;A) I(X;A) is called the Jensen-Shannon divergence. Jensen-Shannon divergence sounds like a pretty natural measurement of the discrepancy between two distributions; if I secretly flip a coin and send you either a draw from p 0 p_0 p0 or p 1 p_1 p1 depending on the outcome, JSD ( p 0 , p 1 ) \operatorname{JSD}(p_0, p_1) JSD(p0,p1) measures how much I've implicitly told you about my coin flip. Unlike KL divergence, it is bounded and symmetric.
So far, we've seen that the bulge in the curve H ( p λ ) H(p_\lambda) H(pλ) measures mutual information I ( X ; A ) I(X;A) I(X;A) between a draw from p λ p_\lambda pλ and a fictitious latent variable A , A, A, and we've remembered how to write down I ( X ; A ) I(X;A) I(X;A) explicitly. However, what's really interesting is how this quantity changes as a function of λ . \lambda. λ. For starters, it is always true that "mixing" p 0 p_0 p0 with a small factor of another distribution p 1 p_1 p1 increases entropy? A little thought should convince you this is not necessarily true unless p 0 p_0 p0 is a Dirac distribution: any infinitesimal perturbation of p 1 p_1 p1 can be achieved by "mixing in" some distribution p 1 , p_1, p1, and the only local minima of entropy are the Dirac distributions.
To find out when mixing in p 1 p_1 p1 increases entropy, let's differentiate. Using that d d λ λ = 0 H ( p + λ q ) = − ∫ q ln p d x , \frac{d}{d \lambda}_{\lambda = 0} H(p + \lambda q) = -\int q \ln p \, dx, dλdλ=0H(p+λq)=−∫qlnpdx, we find d d λ λ = 0 H ( p λ ) = ∫ ( p 0 − p 1 ) ln p 0 d x = H ( p 1 , p 0 ) − H ( p 0 ) \begin{align*} \frac{d}{d \lambda}_{\lambda = 0} H(p_\lambda) & = \int (p_0 - p_1) \ln p_0 \, dx \\ & = H(p_1, p_0) - H(p_0) \\ \end{align*} dλdλ=0H(pλ)=∫(p0−p1)lnp0dx=H(p1,p0)−H(p0) where H ( p 1 , p 0 ) H(p_1, p_0) H(p1,p0) denotes the cross-entropy − E p 1 ln p 0 . -\E_{p_1} \ln p_0. −Ep1lnp0. (Throughout this post, we'll assume that p 0 > 0 p_0 > 0 p0>0 and that limits mercifully commute with integrals. This second assumption is always true, for example, when p 0 p_0 p0 and p 1 p_1 p1 are distributions over a finite set.)
I'm not sure if there's any name for this quantity H ( p 1 , p 0 ) − H ( p 0 ) . H(p_1, p_0) - H(p_0). H(p1,p0)−H(p0). I've started calling it "proclivity" and denoting it by Pc ( p 1 , p 0 ) . \operatorname{Pc}(p_1, p_0). Pc(p1,p0). We can relate proclivity to entropy and KL divergence by Pc ( p 1 , p 0 ) = KL ( p 1 ∥ p 0 ) + H ( p 1 ) − H ( p 0 ) . \operatorname{Pc}(p_1, p_0) = \KL(p_1 \Vert p_0) + H(p_1) - H(p_0). Pc(p1,p0)=KL(p1∥p0)+H(p1)−H(p0). However, unlike KL divergence, proclivity can either be positive or negative. For instance, Pc ( δ y , p ) = ln p ( y ) − E p ln p ( x ) \operatorname{Pc}(\delta_y, p) = \ln p(y) - \E_p \ln p(x) Pc(δy,p)=lnp(y)−Eplnp(x) when δ y \delta_y δy is a Dirac distribution. You can play around with the proclivity between two distributions in the widget below. (Drag to adjust the center of the red distribution.)
... continue reading