GoKawiil - Block Diffusion: Interpolating Autoregressive and Diffusion Language Models

Autoregressive vs. Diffusion Language Models In the language modeling task, we have a sequence of \( L \) tokens \( \mathbf{x} = (\mathbf{x}^1, \dots, \mathbf{x}^L ) \) drawn from the data distribution \( q(\mathbf{x}) \). We aim to fit a model \( p_\theta(\mathbf{x}) \) of \( q \). Autoregressive models define a factorized distribution of the form: \[ \log p_\theta(\mathbf{x}) = \sum_{\ell=1}^L \log p_\theta(\mathbf{x}^\ell \mid \mathbf{x}^{\lt \ell}) \] However, the sequential dependencies between tokens require that AR sampling is restricted to \( L \) sampling steps, which may be slow for long sequences. Diffusion models overcome this limitation by modeling tokens independently, admitting parallel generation. Diffusion models instead fit a model to undo a forward corruption process \( q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \text{Cat}(\mathbf{x_t} ; Q_t \mathbf{x}_{t-1} ) \) using transition matrices \( Q_t \). D3PM (Austin et. al) defines this as \[ p_\theta(\mathbf{x}_s | \mat ... Read full article.

Find Related products on Amazon

Block Diffusion: Interpolating Autoregressive and Diffusion Language Models

Related Articles