A short introduction to optimal transport and Wasserstein distance (2020)

A Short Introduction to Optimal Transport and Wasserstein Distance

These notes provide a brief introduction to optimal transport theory, prioritizing intuition over mathematical rigor. A more rigorous presentation would require some additional background in measure theory. Other good introductory resources for optimal transport theory include:

Why Optimal Transport Theory?

A fundamental problem in statistics and machine learning is to come up with useful measures of “distance” between pairs of probability distributions. Two desirable properties of a distance function are symmetry and the triangle inequality. Unfortunately, many notions of “distance” between probability distributions do not satisfy these properties. These weaker notions of distance are often called divergences. Perhaps the most well-known divergence is the Kullback-Lieibler (KL) divergence:

\[\begin{equation} D_{KL}(P \| Q) = \int p(x) \log \left ( \frac{p(x)}{q(x)} \right ) \mathrm{d}x \end{equation}\]

Where $P$ and $Q$ here denote probability distributions. While the KL divergence is incredibly useful and fundamental in information theory, it also has its shortcomings.

For instance, one of the first things we learn about the KL divergence is that it is not symmetric, \(D_{KL}(P \| Q)

eq D_{KL}(Q \| P)\). This is arguably not a huge problem, since various symmetrized analogues to the KL divergence exist. A bigger problem is that the divergence may be infinite if the support of $P$ and $Q$ are not equal. Below we sketch three examples of 1D distributions for which $D_{KL}(P \| Q) = D_{KL}(Q \| P) = +\infty$.

Three examples with infinite KL divergence. These density functions are infinitely far apart according to KL divergence, since in each case there exist intervals of $x$ where $q(x) = 0$ but $p(x)>0$, leading to division by zero.

Intuitively, some of these distribution pairs seem “closer” to each other than others. But the KL divergence says that they are all infinitely far apart. One way of circumventing this is to smooth (i.e. add blur to) the distributions before computing the KL divergence, so that the support of $P$ and $Q$ matches. However, choosing the bandwidth parameter of the smoothing kernel is not always straightforward.

... continue reading