Tech News
← Back to articles

Entropy of a Mixture

read original related products more articles

Entropy of a Mixture

Given a pair ( p 0 , p 1 ) (p_0, p_1) (p0​,p1​) of probability density functions and an interpolation factor λ ∈ [ 0 , 1 ] , \lambda \in [0, 1], λ∈[0,1], consider the mixture p λ = ( 1 − λ ) p 0 + λ p 1 . p_\lambda = (1 - \lambda) p_0 + \lambda p_1. pλ​=(1−λ)p0​+λp1​. How does the entropy H ( p λ ) = − E ⁡ X ∼ p λ ln ⁡ p λ ( X ) H(p_\lambda) = -\E_{X \sim p_\lambda} \ln p_\lambda(X) H(pλ​)=−EX∼pλ​​lnpλ​(X) of this mixture vary as a function of λ \lambda λ? The widget below shows what this situation looks like for a pair of discrete distributions on the plane. (Drag to adjust λ . \lambda. λ.)

It turns out that entropy is concave as a function of probabilities. This explains the "upward bulge" of our curve. It might also be intuitive to you that, when p 0 p_0 p0​ and p 1 p_1 p1​ are less similar, this curve should bulge upwards more. What exactly does its shape tell us? In this short post, we'll see how questions about H ( p λ ) H(p_\lambda) H(pλ​) lead to information about the relationship between p 0 p_0 p0​ and p 1 . p_1. p1​. For example, we'll see how JSD divergence, KL divergence, and χ 2 \chi^2 χ2 divergence can all be read off from H ( p λ ) . H(p_\lambda). H(pλ​). Actually, this function seems to hold a lot of information about the relationship between p 0 p_0 p0​ and p 1 p_1 p1​—perhaps everything you would ever want to know.

Entropy being concave means that the inequality H ( p λ ) ≥ ( 1 − λ ) H ( p 0 ) + λ H ( p 1 ) H(p_\lambda) \ge (1 - \lambda) H(p_0) + \lambda H(p_1) H(pλ​)≥(1−λ)H(p0​)+λH(p1​) holds for all pairs ( p 0 , p 1 ) (p_0, p_1) (p0​,p1​) and scalars λ ∈ [ 0 , 1 ] . \lambda \in [0, 1]. λ∈[0,1]. Actually, this kind of inequality has a very important interpretation: in general, the gap in a Jensen inequality on entropy measures mutual information.

To see how this works, let's introduce a Bernoulli variable A . A. A. Let A = 1 A = 1 A=1 with probability λ \lambda λ and A = 0 A = 0 A=0 otherwise, and let X X X be a draw from p 1 p_1 p1​ when A = 1 A = 1 A=1 and a draw from p 0 p_0 p0​ when A = 0. A = 0. A=0. Overall, X ∼ p λ . X \sim p_\lambda. X∼pλ​. In terms of the variables X X X and A , A, A, the weighted average of entropies λ H ( p 0 ) + ( 1 − λ ) H ( p 1 ) \lambda H(p_0) + (1 - \lambda) H(p_1) λH(p0​)+(1−λ)H(p1​) is called the conditional entropy H ( X ∣ A ) . H(X|A). H(X∣A). We can interpret conditional entropy as the average surprisal we pay on our prediction of X X X if we're given the value of A A A in advance. Our inequality above states that H ( X ∣ A ) H(X|A) H(X∣A) is never larger than H ( X ) , H(X), H(X), which I hope is intuitive information-theoretically! Mutual information I ( X ; A ) I(X;A) I(X;A) measures the gap H ( X ) − H ( X ∣ A ) H(X) - H(X|A) H(X)−H(X∣A); in other words, H ( p λ ) = ( 1 − λ ) H ( p 0 ) + λ H ( p 1 ) + I ( X ; A ) . H(p_\lambda) = (1 - \lambda) H(p_0) + \lambda H(p_1) + I(X;A). H(pλ​)=(1−λ)H(p0​)+λH(p1​)+I(X;A). For two variables X X X and A A A in general, we would write H ( E ⁡ A p ( x ∣ A ) ) = E ⁡ A H ( p ( x ∣ A ) ) + I ( X ; A ) H(\E_A p(x|A)) = \E_A H(p(x|A)) + I(X;A) H(EA​p(x∣A))=EA​H(p(x∣A))+I(X;A) where p ( x ∣ A ) p(x|A) p(x∣A) is understood as a random distribution depending on A . A. A. The bound H ( E A p ( x ∣ A ) ) ≤ E A H ( p ( x ∣ A ) ) H(E_A p(x|A)) \le E_A H(p(x|A)) H(EA​p(x∣A))≤EA​H(p(x∣A)) implied by non-negativity of mutual information is called a Jensen inequality, and it follows from concavity of H . H. H.

Now, the idea that I ( X ; A ) I(X;A) I(X;A) measures how much knowing A A A decreases our expected surprisal about X X X leads to a way to express I ( X ; A ) I(X;A) I(X;A) in terms of KL divergences between the distributions p 0 , p_0, p0​, p 1 p_1 p1​ and p λ . p_\lambda. pλ​. Explicitly, I ( X ; A ) = E ⁡ A , X [ ln ⁡ p ( X ) − ln ⁡ p ( X ∣ A ) ] = E ⁡ A KL ⁡ ( posterior on X ∥ prior on X ) = ( 1 − λ ) KL ⁡ ( p 0 ∥ p λ ) + λ KL ⁡ ( p 1 ∥ p λ ) . \begin{align*} I(X;A) & = \E_{A,X} \bigl[ \ln p(X) - \ln p(X|A) \bigr] \\ & = \E_A \KL(\text{posterior on } X \Vert \text{prior on } X) \\ & = (1 - \lambda) \KL(p_0 \Vert p_\lambda) + \lambda \KL(p_1 \Vert p_\lambda). \end{align*} I(X;A)​=EA,X​[lnp(X)−lnp(X∣A)]=EA​KL(posterior on X∥prior on X)=(1−λ)KL(p0​∥pλ​)+λKL(p1​∥pλ​).​ We can also put our two variables in the other order and ask how much X X X tells us about A . A. A. Where KL ⁡ ( x ∥ y ) = x ln ⁡ ( x / y ) + ( 1 − x ) ln ⁡ ( ( 1 − x ) / ( 1 − y ) ) \KL(x \Vert y) = x \ln(x/y) + (1 - x) \ln((1 - x)/(1-y)) KL(x∥y)=xln(x/y)+(1−x)ln((1−x)/(1−y)) denotes the KL divergence between two Bernoulli distributions with rates x x x and y , y, y, we find that I ( X ; A ) = E ⁡ X KL ⁡ ( posterior on A ∥ prior on A ) = E ⁡ X KL ⁡ ( λ p 1 p λ ∥ λ ) = E ⁡ X [ − ( 1 − λ ) p 0 p λ ( ln ⁡ ( 1 − λ ) − ln ⁡ ( 1 − λ ) p 0 p λ ) = ∫ − λ p 1 p λ ( ln ⁡ λ − ln ⁡ λ p 1 p λ ) ] = E ⁡ X [ ( 1 − λ ) p 0 p λ ln ⁡ p 0 p λ + λ p 1 p λ ln ⁡ p 1 p λ ] , \begin{align*} I(X;A) & = \E_X \KL(\text{posterior on } A \Vert \text{prior on } A) \\ & = \E_X \KL\left( \frac{\lambda p_1}{p_\lambda} \middle \Vert \lambda \right) \\ & = \E_X \biggl[ -\frac{(1 - \lambda) p_0}{p_\lambda} \left( \ln (1 - \lambda) - \ln \frac{(1 - \lambda) p_0}{p_\lambda} \right) \\ & \phantom{=\int} -\frac{\lambda p_1}{p_\lambda} \left( \ln \lambda - \ln \frac{\lambda p_1}{p_\lambda} \right) \biggr] \\ & = \E_X \biggl[ \frac{(1 - \lambda) p_0}{p_\lambda} \ln \frac{p_0}{p_\lambda} +\frac{\lambda p_1}{p_\lambda} \ln \frac{p_1}{p_\lambda} \biggr], \end{align*} I(X;A)​=EX​KL(posterior on A∥prior on A)=EX​KL(pλ​λp1​​ ​λ)=EX​[−pλ​(1−λ)p0​​(ln(1−λ)−lnpλ​(1−λ)p0​​)=∫−pλ​λp1​​(lnλ−lnpλ​λp1​​)]=EX​[pλ​(1−λ)p0​​lnpλ​p0​​+pλ​λp1​​lnpλ​p1​​],​ which is the same as our earlier expression with the order of integration swapped. (Just write the expectation as an integral and cancel out the factors of p λ p_\lambda pλ​!) There are a lot of ways to think about this kind of expression.

After a little more rearranging, we can also show that I ( X ; A ) I(X;A) I(X;A) is an f f f-divergence between the distributions p 0 p_0 p0​ and p 1 , p_1, p1​, meaning it can be written in the form E ⁡ p 0 f ( p 0 / p 1 ) \E_{p_0} f(p_0/p_1) Ep0​​f(p0​/p1​) for some convex function f f f with a few technical properties. When λ = 1 / 2 , \lambda = 1/2, λ=1/2, I ( X ; A ) I(X;A) I(X;A) is called the Jensen-Shannon divergence. Jensen-Shannon divergence sounds like a pretty natural measurement of the discrepancy between two distributions; if I secretly flip a coin and send you either a draw from p 0 p_0 p0​ or p 1 p_1 p1​ depending on the outcome, JSD ⁡ ( p 0 , p 1 ) \operatorname{JSD}(p_0, p_1) JSD(p0​,p1​) measures how much I've implicitly told you about my coin flip. Unlike KL divergence, it is bounded and symmetric.

So far, we've seen that the bulge in the curve H ( p λ ) H(p_\lambda) H(pλ​) measures mutual information I ( X ; A ) I(X;A) I(X;A) between a draw from p λ p_\lambda pλ​ and a fictitious latent variable A , A, A, and we've remembered how to write down I ( X ; A ) I(X;A) I(X;A) explicitly. However, what's really interesting is how this quantity changes as a function of λ . \lambda. λ. For starters, it is always true that "mixing" p 0 p_0 p0​ with a small factor of another distribution p 1 p_1 p1​ increases entropy? A little thought should convince you this is not necessarily true unless p 0 p_0 p0​ is a Dirac distribution: any infinitesimal perturbation of p 1 p_1 p1​ can be achieved by "mixing in" some distribution p 1 , p_1, p1​, and the only local minima of entropy are the Dirac distributions.

To find out when mixing in p 1 p_1 p1​ increases entropy, let's differentiate. Using that d d λ λ = 0 H ( p + λ q ) = − ∫ q ln ⁡ p d x , \frac{d}{d \lambda}_{\lambda = 0} H(p + \lambda q) = -\int q \ln p \, dx, dλd​λ=0​H(p+λq)=−∫qlnpdx, we find d d λ λ = 0 H ( p λ ) = ∫ ( p 0 − p 1 ) ln ⁡ p 0 d x = H ( p 1 , p 0 ) − H ( p 0 ) \begin{align*} \frac{d}{d \lambda}_{\lambda = 0} H(p_\lambda) & = \int (p_0 - p_1) \ln p_0 \, dx \\ & = H(p_1, p_0) - H(p_0) \\ \end{align*} dλd​λ=0​H(pλ​)​=∫(p0​−p1​)lnp0​dx=H(p1​,p0​)−H(p0​)​ where H ( p 1 , p 0 ) H(p_1, p_0) H(p1​,p0​) denotes the cross-entropy − E ⁡ p 1 ln ⁡ p 0 . -\E_{p_1} \ln p_0. −Ep1​​lnp0​. (Throughout this post, we'll assume that p 0 > 0 p_0 > 0 p0​>0 and that limits mercifully commute with integrals. This second assumption is always true, for example, when p 0 p_0 p0​ and p 1 p_1 p1​ are distributions over a finite set.)

I'm not sure if there's any name for this quantity H ( p 1 , p 0 ) − H ( p 0 ) . H(p_1, p_0) - H(p_0). H(p1​,p0​)−H(p0​). I've started calling it "proclivity" and denoting it by Pc ⁡ ( p 1 , p 0 ) . \operatorname{Pc}(p_1, p_0). Pc(p1​,p0​). We can relate proclivity to entropy and KL divergence by Pc ⁡ ( p 1 , p 0 ) = KL ⁡ ( p 1 ∥ p 0 ) + H ( p 1 ) − H ( p 0 ) . \operatorname{Pc}(p_1, p_0) = \KL(p_1 \Vert p_0) + H(p_1) - H(p_0). Pc(p1​,p0​)=KL(p1​∥p0​)+H(p1​)−H(p0​). However, unlike KL divergence, proclivity can either be positive or negative. For instance, Pc ⁡ ( δ y , p ) = ln ⁡ p ( y ) − E ⁡ p ln ⁡ p ( x ) \operatorname{Pc}(\delta_y, p) = \ln p(y) - \E_p \ln p(x) Pc(δy​,p)=lnp(y)−Ep​lnp(x) when δ y \delta_y δy​ is a Dirac distribution. You can play around with the proclivity between two distributions in the widget below. (Drag to adjust the center of the red distribution.)

... continue reading