Link to LessWrong post here.
KL-divergence is a topic which crops up in a ton of different places in information theory and machine learning, so it's important to understand well. Unfortunately, it has some properties which seem confusing at a first pass (e.g. it isn't symmetric like we would expect from most distance measures, and it can be unbounded as we take the limit of probabilities going to zero). There are lots of different ways you can develop good intuitions for it that I've come across in the past. This post is my attempt to collate all these intuitions, and try and identify the underlying commonalities between them. I hope that for everyone reading this, there will be at least one that you haven't come across before and that improves your overall understanding!
One other note - there is some overlap between each of these (some of them can be described as pretty much just rephrasings of others), so you might want to just browse the ones that look interesting to you. Also, I expect a large fraction of the value of this post (maybe >50%) comes from the summary, so you might just want to read that and skip the rest!
Summary
1. Expected surprise
D K L ( P | | Q ) = how much more surprised you expect to be when observing data with distribution P , if you falsely believe the distribution is Q vs if you know the true distribution
2. Hypothesis Testing
D K L ( P | | Q ) = the amount of evidence we expect to get for P over Q in hypothesis testing, if P is true.
3. MLEs
If P is an empirical distribution of data, D K L ( P | | Q ) is minimised (over Q ) when Q is the maximum likelihood estimator for P .
... continue reading