Entropy of a Mixture
Given a pair ( p 0 , p 1 ) (p_0, p_1) (p0,p1) of probability density functions and an interpolation factor λ ∈ [ 0 , 1 ] , \lambda \in [0, 1], λ∈[0,1], consider the mixture p λ = ( 1 − λ ) p 0 + λ p 1 . p_\lambda = (1 - \lambda) p_0 + \lambda p_1. pλ=(1−λ)p0+λp1. How does the entropy H ( p λ ) = − E X ∼ p λ ln p λ ( X ) H(p_\lambda) = -\E_{X \sim p_\lambda} \ln p_\lambda(X) H(pλ)=−EX∼pλlnpλ(X) of this mixture vary as a function of λ \lambda λ? The widget below shows what this situation looks like for a pair of discrete distributions on the plane. (Drag to adjust λ . \lambda. λ.)
It turns out that entropy is concave as a function of probabilities. This explains the "upward bulge" of our curve. It might also be intuitive to you that, when p 0 p_0 p0 and p 1 p_1 p1 are less similar, this curve should bulge upwards more. What exactly does its shape tell us? In this short post, we'll see how questions about H ( p λ ) H(p_\lambda) H(pλ) lead to information about the relationship between p 0 p_0 p0 and p 1 . p_1. p1. For example, we'll see how JSD divergence, KL divergence, and χ 2 \chi^2 χ2 divergence can all be read off from H ( p λ ) . H(p_\lambda). H(pλ). Actually, this function seems to hold a lot of information about the relationship between p 0 p_0 p0 and p 1 p_1 p1—perhaps everything you would ever want to know.
Entropy being concave means that the inequality H ( p λ ) ≥ ( 1 − λ ) H ( p 0 ) + λ H ( p 1 ) H(p_\lambda) \ge (1 - \lambda) H(p_0) + \lambda H(p_1) H(pλ)≥(1−λ)H(p0)+λH(p1) holds for all pairs ( p 0 , p 1 ) (p_0, p_1) (p0,p1) and scalars λ ∈ [ 0 , 1 ] . \lambda \in [0, 1]. λ∈[0,1]. Actually, this kind of inequality has a very important interpretation: in general, the gap in a Jensen inequality on entropy measures mutual information.
To see how this works, let's introduce a Bernoulli variable A . A. A. Let A = 1 A = 1 A=1 with probability λ \lambda λ and A = 0 A = 0 A=0 otherwise, and let X X X be a draw from p 1 p_1 p1 when A = 1 A = 1 A=1 and a draw from p 0 p_0 p0 when A = 0. A = 0. A=0. Overall, X ∼ p λ . X \sim p_\lambda. X∼pλ. In terms of the variables X X X and A , A, A, the weighted average of entropies λ H ( p 0 ) + ( 1 − λ ) H ( p 1 ) \lambda H(p_0) + (1 - \lambda) H(p_1) λH(p0)+(1−λ)H(p1) is called the conditional entropy H ( X ∣ A ) . H(X|A). H(X∣A). We can interpret conditional entropy as the average surprisal we pay on our prediction of X X X if we're given the value of A A A in advance. Our inequality above states that H ( X ∣ A ) H(X|A) H(X∣A) is never larger than H ( X ) , H(X), H(X), which I hope is intuitive information-theoretically! Mutual information I ( X ; A ) I(X;A) I(X;A) measures the gap H ( X ) − H ( X ∣ A ) H(X) - H(X|A) H(X)−H(X∣A); in other words, H ( p λ ) = ( 1 − λ ) H ( p 0 ) + λ H ( p 1 ) + I ( X ; A ) . H(p_\lambda) = (1 - \lambda) H(p_0) + \lambda H(p_1) + I(X;A). H(pλ)=(1−λ)H(p0)+λH(p1)+I(X;A). For two variables X X X and A A A in general, we would write H ( E A p ( x ∣ A ) ) = E A H ( p ( x ∣ A ) ) + I ( X ; A ) H(\E_A p(x|A)) = \E_A H(p(x|A)) + I(X;A) H(EAp(x∣A))=EAH(p(x∣A))+I(X;A) where p ( x ∣ A ) p(x|A) p(x∣A) is understood as a random distribution depending on A . A. A. The bound H ( E A p ( x ∣ A ) ) ≤ E A H ( p ( x ∣ A ) ) H(E_A p(x|A)) \le E_A H(p(x|A)) H(EAp(x∣A))≤EAH(p(x∣A)) implied by non-negativity of mutual information is called a Jensen inequality, and it follows from concavity of H . H. H.
Now, the idea that I ( X ; A ) I(X;A) I(X;A) measures how much knowing A A A decreases our expected surprisal about X X X leads to a way to express I ( X ; A ) I(X;A) I(X;A) in terms of KL divergences between the distributions p 0 , p_0, p0, p 1 p_1 p1 and p λ . p_\lambda. pλ. Explicitly, I ( X ; A ) = E A , X [ ln p ( X ) − ln p ( X ∣ A ) ] = E A KL ( posterior on X ∥ prior on X ) = ( 1 − λ ) KL ( p 0 ∥ p λ ) + λ KL ( p 1 ∥ p λ ) . \begin{align*} I(X;A) & = \E_{A,X} \bigl[ \ln p(X) - \ln p(X|A) \bigr] \\ & = \E_A \KL(\text{posterior on } X \Vert \text{prior on } X) \\ & = (1 - \lambda) \KL(p_0 \Vert p_\lambda) + \lambda \KL(p_1 \Vert p_\lambda). \end{align*} I(X;A)=EA,X[lnp(X)−lnp(X∣A)]=EAKL(posterior on X∥prior on X)=(1−λ)KL(p0∥pλ)+λKL(p1∥pλ). We can also put our two variables in the other order and ask how much X X X tells us about A . A. A. Where KL ( x ∥ y ) = x ln ( x / y ) + ( 1 − x ) ln ( ( 1 − x ) / ( 1 − y ) ) \KL(x \Vert y) = x \ln(x/y) + (1 - x) \ln((1 - x)/(1-y)) KL(x∥y)=xln(x/y)+(1−x)ln((1−x)/(1−y)) denotes the KL divergence between two Bernoulli distributions with rates x x x and y , y, y, we find that I ( X ; A ) = E X KL ( posterior on A ∥ prior on A ) = E X KL ( λ p 1 p λ ∥ λ ) = E X [ − ( 1 − λ ) p 0 p λ ( ln ( 1 − λ ) − ln ( 1 − λ ) p 0 p λ ) = ∫ − λ p 1 p λ ( ln λ − ln λ p 1 p λ ) ] = E X [ ( 1 − λ ) p 0 p λ ln p 0 p λ + λ p 1 p λ ln p 1 p λ ] , \begin{align*} I(X;A) & = \E_X \KL(\text{posterior on } A \Vert \text{prior on } A) \\ & = \E_X \KL\left( \frac{\lambda p_1}{p_\lambda} \middle \Vert \lambda \right) \\ & = \E_X \biggl[ -\frac{(1 - \lambda) p_0}{p_\lambda} \left( \ln (1 - \lambda) - \ln \frac{(1 - \lambda) p_0}{p_\lambda} \right) \\ & \phantom{=\int} -\frac{\lambda p_1}{p_\lambda} \left( \ln \lambda - \ln \frac{\lambda p_1}{p_\lambda} \right) \biggr] \\ & = \E_X \biggl[ \frac{(1 - \lambda) p_0}{p_\lambda} \ln \frac{p_0}{p_\lambda} +\frac{\lambda p_1}{p_\lambda} \ln \frac{p_1}{p_\lambda} \biggr], \end{align*} I(X;A)=EXKL(posterior on A∥prior on A)=EXKL(pλλp1 λ)=EX[−pλ(1−λ)p0(ln(1−λ)−lnpλ(1−λ)p0)=∫−pλλp1(lnλ−lnpλλp1)]=EX[pλ(1−λ)p0lnpλp0+pλλp1lnpλp1], which is the same as our earlier expression with the order of integration swapped. (Just write the expectation as an integral and cancel out the factors of p λ p_\lambda pλ!) There are a lot of ways to think about this kind of expression.
After a little more rearranging, we can also show that I ( X ; A ) I(X;A) I(X;A) is an f f f-divergence between the distributions p 0 p_0 p0 and p 1 , p_1, p1, meaning it can be written in the form E p 0 f ( p 0 / p 1 ) \E_{p_0} f(p_0/p_1) Ep0f(p0/p1) for some convex function f f f with a few technical properties. When λ = 1 / 2 , \lambda = 1/2, λ=1/2, I ( X ; A ) I(X;A) I(X;A) is called the Jensen-Shannon divergence. Jensen-Shannon divergence sounds like a pretty natural measurement of the discrepancy between two distributions; if I secretly flip a coin and send you either a draw from p 0 p_0 p0 or p 1 p_1 p1 depending on the outcome, JSD ( p 0 , p 1 ) \operatorname{JSD}(p_0, p_1) JSD(p0,p1) measures how much I've implicitly told you about my coin flip. Unlike KL divergence, it is bounded and symmetric.
So far, we've seen that the bulge in the curve H ( p λ ) H(p_\lambda) H(pλ) measures mutual information I ( X ; A ) I(X;A) I(X;A) between a draw from p λ p_\lambda pλ and a fictitious latent variable A , A, A, and we've remembered how to write down I ( X ; A ) I(X;A) I(X;A) explicitly. However, what's really interesting is how this quantity changes as a function of λ . \lambda. λ. For starters, it is always true that "mixing" p 0 p_0 p0 with a small factor of another distribution p 1 p_1 p1 increases entropy? A little thought should convince you this is not necessarily true unless p 0 p_0 p0 is a Dirac distribution: any infinitesimal perturbation of p 1 p_1 p1 can be achieved by "mixing in" some distribution p 1 , p_1, p1, and the only local minima of entropy are the Dirac distributions.
To find out when mixing in p 1 p_1 p1 increases entropy, let's differentiate. Using that d d λ λ = 0 H ( p + λ q ) = − ∫ q ln p d x , \frac{d}{d \lambda}_{\lambda = 0} H(p + \lambda q) = -\int q \ln p \, dx, dλdλ=0H(p+λq)=−∫qlnpdx, we find d d λ λ = 0 H ( p λ ) = ∫ ( p 0 − p 1 ) ln p 0 d x = H ( p 1 , p 0 ) − H ( p 0 ) \begin{align*} \frac{d}{d \lambda}_{\lambda = 0} H(p_\lambda) & = \int (p_0 - p_1) \ln p_0 \, dx \\ & = H(p_1, p_0) - H(p_0) \\ \end{align*} dλdλ=0H(pλ)=∫(p0−p1)lnp0dx=H(p1,p0)−H(p0) where H ( p 1 , p 0 ) H(p_1, p_0) H(p1,p0) denotes the cross-entropy − E p 1 ln p 0 . -\E_{p_1} \ln p_0. −Ep1lnp0. (Throughout this post, we'll assume that p 0 > 0 p_0 > 0 p0>0 and that limits mercifully commute with integrals. This second assumption is always true, for example, when p 0 p_0 p0 and p 1 p_1 p1 are distributions over a finite set.)
I'm not sure if there's any name for this quantity H ( p 1 , p 0 ) − H ( p 0 ) . H(p_1, p_0) - H(p_0). H(p1,p0)−H(p0). I've started calling it "proclivity" and denoting it by Pc ( p 1 , p 0 ) . \operatorname{Pc}(p_1, p_0). Pc(p1,p0). We can relate proclivity to entropy and KL divergence by Pc ( p 1 , p 0 ) = KL ( p 1 ∥ p 0 ) + H ( p 1 ) − H ( p 0 ) . \operatorname{Pc}(p_1, p_0) = \KL(p_1 \Vert p_0) + H(p_1) - H(p_0). Pc(p1,p0)=KL(p1∥p0)+H(p1)−H(p0). However, unlike KL divergence, proclivity can either be positive or negative. For instance, Pc ( δ y , p ) = ln p ( y ) − E p ln p ( x ) \operatorname{Pc}(\delta_y, p) = \ln p(y) - \E_p \ln p(x) Pc(δy,p)=lnp(y)−Eplnp(x) when δ y \delta_y δy is a Dirac distribution. You can play around with the proclivity between two distributions in the widget below. (Drag to adjust the center of the red distribution.)
Proclivity also arises when differentiating cross-entropy with respect to temperature. Explicitly, given a family of distributions p β ∝ exp ( − β H ) p_\beta \propto \exp(-\beta H) pβ∝exp(−βH) depending on an inverse temperature β , \beta, β, it turns out that d d β β = 1 H ( q , p β ) = Pc ( q , p ) . \frac{d}{d \beta}_{\beta = 1} H(q, p_\beta) = \operatorname{Pc}(q, p). dβdβ=1H(q,pβ)=Pc(q,p). This is the justification for my name: the proclivity Pc ( q , p ) \operatorname {Pc}(q, p) Pc(q,p) is positive exactly when ln p \ln p lnp is "predisposed" to being a good predictive distribution for q q q in the sense that decreasing temperature causes a first-order decrease in cross-entropy loss. If we arrange the four quantities of the form − E p i ln p j -\E_{p_i} \ln p_j −Epilnpj in a little square, proclivities go at right angles to KL divergences: H ( q ) → Pc ( p , q ) H ( p , q ) KL ( q ∥ p ) ↓ ↑ KL ( p ∥ q ) H ( q , p ) ← Pc ( q , p ) H ( p ) \begin{CD} H(q) @>\operatorname{Pc}(p, q)>> H(p, q) \\ @V\KL(q \Vert p)VV @AA\KL(p \Vert q)A\\ H(q, p) @<<\operatorname{Pc}(q, p)< H(p) \end{CD} H(q)KL(q∥p)↓ ⏐H(q,p)Pc(p,q) Pc(q,p)H(p,q)⏐ ↑KL(p∥q)H(p) (In this diagram, an arrow A → C B A \xrightarrow{C} B AC B stands for an equation A + C = B . A + C = B. A+C=B.)
The derivative of mutual information is a more familiar quantity. In general you can check that d d λ I ( X ; A ) = KL ( p 1 ∥ p λ ) − KL ( p 0 ∥ p λ ) , \frac{d}{d \lambda} I(X;A) = \KL(p_1 \Vert p_\lambda) - \KL(p_0 \Vert p_\lambda), dλdI(X;A)=KL(p1∥pλ)−KL(p0∥pλ), and at λ = 0 \lambda = 0 λ=0 we simply have I ( X ; A ) = λ KL ( p 1 ∥ p 0 ) + O ( λ 2 ) . I(X;A) = \lambda \KL(p_1 \Vert p_0) + O(\lambda^2). I(X;A)=λKL(p1∥p0)+O(λ2). Moreover, by convexity, the first-order approximation is an upper bound: I ( X ; A ) ≤ λ KL ( p 1 ∥ p 0 ) . I(X;A) \le \lambda \KL(p_1 \Vert p_0). I(X;A)≤λKL(p1∥p0). To wrap up this post, let's think about what this approximation means and take a look at the higher-order terms of the Taylor series for I ( X ; A ) I(X;A) I(X;A) near λ = 0. \lambda = 0. λ=0.
There's a remarkable interpretation of our upper bound in terms of predictive surprisal, which goes as follows. Suppose we want to predict the variable X . X. X. Given access to A , A, A, we can predict X X X with only ( 1 − λ ) H ( p 0 ) + λ H ( p 1 ) (1 - \lambda) H(p_0) + \lambda H(p_1) (1−λ)H(p0)+λH(p1) nats of surprisal per sample by using p 0 p_0 p0 when A = 0 A = 0 A=0 and p 1 p_1 p1 otherwise. On the other hand, always using p 0 p_0 p0 as our predictive distribution costs ( 1 − λ ) H ( p 0 ) + λ ( H ( p 1 ) + KL ( p 1 ∥ p 0 ) ) (1 - \lambda) H(p_0) + \lambda (H(p_1) + \KL(p_1 \Vert p_0)) (1−λ)H(p0)+λ(H(p1)+KL(p1∥p0)) nats of surprisal per sample, which is only λ KL ( p 1 ∥ p 0 ) \lambda \KL(p_1 \Vert p_0) λKL(p1∥p0) more than what we paid when we knew A . A. A. Since mutual information bounds the extent by which knowledge of A A A can decrease our average surprisal compared to the best predictor that is ignorant of A , A, A, this proves that I ( X ; A ) ≤ λ KL ( p 1 ∥ p 0 ) I(X;A) \le \lambda \KL(p_1 \Vert p_0) I(X;A)≤λKL(p1∥p0)!
Another point of view on our approximation comes from that classic Bayesian example of testing for a rare condition. If there is a small chance that A = 1 A = 1 A=1 and we observe the result of a diagnostic test X , X, X, how much do we expect to learn?
If we're already pretty sure that A = 0 A = 0 A=0 and this turns out to be the case, we don't expect to get much information from the test. This is true in terms of information gain; in response to a value X , X, X, Bayes' rule instructs us to increase the log likelihood of our belief that A = 0 A = 0 A=0 by ln ( p 0 ( X ) / p λ ( X ) ) , \ln (p_0(X)/p_\lambda(X)), ln(p0(X)/pλ(X)), but in expectation over X ∼ p 0 X \sim p_0 X∼p0 this causes only an update of E p 0 [ p 0 ln p 0 p λ ] = KL ( p 0 ∥ p λ ) = O ( λ 2 ) . \E_{p_0} \left[p_0 \ln \frac{p_0}{p_\lambda}\right] = \KL(p_0 \Vert p_\lambda) = O(\lambda^2). Ep0[p0lnpλp0]=KL(p0∥pλ)=O(λ2). The main value of such a test is in detecting the unexpected event that A = 1. A = 1. A=1. Again thinking in terms of information gain, we compute that the expected update to the log likelihood of A = 1 A = 1 A=1 when X X X is drawn from p 1 p_1 p1 is E p 1 [ p 1 ln p 1 p λ ] = KL ( p 1 ∥ p 0 ) + O ( λ ) . \E_{p_1} \left[ p_1 \ln \frac{p_1}{p_\lambda} \right] = \KL(p_1 \Vert p_0) + O(\lambda). Ep1[p1lnpλp1]=KL(p1∥p0)+O(λ). Since A = 1 A = 1 A=1 with probability λ , \lambda, λ, we conclude overall that the test result X X X gives us λ KL ( p 1 ∥ p 0 ) + O ( λ 2 ) \lambda \KL(p_1 \Vert p_0) + O(\lambda^2) λKL(p1∥p0)+O(λ2) nats of information about X X X on average, in agreement with our calculation above.
You can see the behavior of I ( X ; A ) I(X;A) I(X;A) for small values of λ \lambda λ in the widget below. I've graphed λ \lambda λ on a logarithmic scale. The shaded region shows the range 0 ≤ I ( X ; A ) ≤ I ( A ) 0 \le I(X;A) \le I(A) 0≤I(X;A)≤I(A) of possible values mutual information could take, the bold line shows the true value of I ( X ; A ) , I(X;A), I(X;A), and the faint line shows the upper bound of λ KL ( p 1 ∥ p 0 ) . \lambda \KL(p_1 \Vert p_0). λKL(p1∥p0).
As you move the center of the red distribution to the right, our upper bound becomes worse because the approximation KL ( p 1 ∥ p λ ) ≈ KL ( p 1 ∥ p 0 ) + O ( λ ) \KL(p_1 \Vert p_\lambda) \approx \KL(p_1 \Vert p_0) + O(\lambda) KL(p1∥pλ)≈KL(p1∥p0)+O(λ) deteriorates—the function KL ( p 1 ∥ q ) \KL(p_1 \Vert q) KL(p1∥q) is coming closer to having a singularity at q = q 0 . q = q_0. q=q0.
Finally, let's look at higher-order terms of the series for I ( X ; A ) . I(X;A). I(X;A). A simple calculation shows that I ( X ; A ) = λ KL ( p 1 ∥ p 0 ) − ∑ k = 2 ∞ ( − 1 ) k k ( k − 1 ) C k λ k I(X;A) = \lambda \KL(p_1 \Vert p_0) - \sum_{k = 2}^\infty \frac{(-1)^k}{k(k - 1)} C_k \lambda^k I(X;A)=λKL(p1∥p0)−k=2∑∞k(k−1)(−1)kCkλk where C k C_k Ck are the moments C k = E p 0 ( p 1 − p 0 p 0 ) k . C_k = \E_{p_0} \left(\frac{p_1 - p_0}{p_0}\right)^k. Ck=Ep0(p0p1−p0)k. One way to derive this is by thinking about the power series for entropy H ( p ) H(p) H(p) as a function of the density function p . p. p. It suffices to know the power series for − ( q + p ) ln ( q + p ) -(q + p) \ln (q + p) −(q+p)ln(q+p) in p , p, p, which is − q ln q − p ln q − ∑ k = 2 ∞ ( − 1 ) k k ( k − 1 ) p k q k − 1 . \begin{align*} {}- q \ln q - p \ln q - \sum_{k = 2}^\infty \frac{(-1)^k}{k(k - 1)} \frac{p^k}{q^{k - 1}}. \end{align*} −qlnq−plnq−k=2∑∞k(k−1)(−1)kqk−1pk. One interesting feature of this series is its second-order term, which involves the quantity E p 0 ( p 1 − p 0 ) 2 p 0 = ∫ ( p 1 − p 0 ) 2 p 0 2 d x . \E_{p_0} \frac{(p_1 - p_0)^2}{p_0} = \int \frac{(p_1 - p_0)^2}{p_0^2} \, dx. Ep0p0(p1−p0)2=∫p02(p1−p0)2dx. This is sometimes called the Neyman χ 2 \chi^2 χ2 divergence, another well-known f f f-divergence used in hypothesis testing.
We've seen so far that, besides the entropies H ( p 0 ) H(p_0) H(p0) and H ( p 1 ) , H(p_1), H(p1), the function H ( p λ ) H(p_\lambda) H(pλ) tells us about KL ( p 0 ∥ p 1 ) , \KL(p_0 \Vert p_1), KL(p0∥p1), JSD ( p 0 , p 1 ) \operatorname{JSD}(p_0, p_1) JSD(p0,p1) and the χ 2 \chi^2 χ2 divergence χ 2 ( p 0 , p 1 ) . \chi^2(p_0, p_1). χ2(p0,p1). These quantities, along with H ( p λ ) H(p_\lambda) H(pλ) itself, can all be written in the form E p 0 f ( p 0 p 1 ) \E_{p_0} f\left( \frac{p_0}{p_1} \right) Ep0f(p1p0) for some functions f . f. f. In fact, our series expansion says more: H ( p λ ) H(p_\lambda) H(pλ) determines all the moments of p 0 ( X ) / p 1 ( X ) p_0(X)/p_1(X) p0(X)/p1(X) under X ∼ p 0 X \sim p_0 X∼p0! Under some regularity conditions—for example, if p 0 p_0 p0 and p 1 p_1 p1 have finite support—we conclude that H ( p λ ) H(p_\lambda) H(pλ) fully describes the distribution of p 0 / p 1 . p_0/p_1. p0/p1. Note that the distribution of p 0 / p 1 p_0/p_1 p0/p1 under p 0 p_0 p0 determines its distribution under p 1 p_1 p1 and indeed under p λ p_\lambda pλ for any λ , \lambda, λ, so we can speak generally of the "distribution of likelihood ratios between two distributions."