Skip to content
Tech News
← Back to articles

Six (and a half) intuitions for KL divergence

read original get KL Divergence Cheat Sheet → more articles
Why This Matters

Understanding KL divergence is crucial for the tech industry and consumers because it provides insights into how models and algorithms measure differences between data distributions, impacting areas like machine learning, data compression, and hypothesis testing. Grasping its properties helps improve model accuracy, optimize data encoding, and interpret probabilistic systems more effectively.

Key Takeaways

Link to LessWrong post here.

KL-divergence is a topic which crops up in a ton of different places in information theory and machine learning, so it's important to understand well. Unfortunately, it has some properties which seem confusing at a first pass (e.g. it isn't symmetric like we would expect from most distance measures, and it can be unbounded as we take the limit of probabilities going to zero). There are lots of different ways you can develop good intuitions for it that I've come across in the past. This post is my attempt to collate all these intuitions, and try and identify the underlying commonalities between them. I hope that for everyone reading this, there will be at least one that you haven't come across before and that improves your overall understanding!

One other note - there is some overlap between each of these (some of them can be described as pretty much just rephrasings of others), so you might want to just browse the ones that look interesting to you. Also, I expect a large fraction of the value of this post (maybe >50%) comes from the summary, so you might just want to read that and skip the rest!

Summary

1. Expected surprise

D K L ( P | | Q ) = how much more surprised you expect to be when observing data with distribution P , if you falsely believe the distribution is Q vs if you know the true distribution

2. Hypothesis Testing

D K L ( P | | Q ) = the amount of evidence we expect to get for P over Q in hypothesis testing, if P is true.

3. MLEs

If P is an empirical distribution of data, D K L ( P | | Q ) is minimised (over Q ) when Q is the maximum likelihood estimator for P .

... continue reading