Speculative Sampling
The idea of speculative sampling is to use a draft sampling to achieve the same sampling result as the target sampling.
We have a target sampling distribution $p(x)$ and a draft sampling distribution $q(x)$. For each token $x_i$, we have two probabilities: $p(x_i)$ and $q(x_i)$, and we have:
either $p(x_i) > q(x_i)$,
or $p(x_i) \leq q(x_i)$,
If we directly sample from $q(x)$, we will get a sample $x$ that is not from the target distribution $p(x)$, and it is
over-sampled if $q(x_i) > p(x_i)$,
under-sampled if $q(x_i) < p(x_i)$.
The core trick of speculative sampling is to design a smart rejection method to down-sample the over-sampled tokens and up-sample the under-sampled tokens. This way we can achieve the same sampling result as the target sampling.
How to down-sample the over-sampled tokens?
... continue reading