Low-latency streaming generation
Some background on Codec Language Modeling. A codec language model (LM) operates on discrete sequences of tokens from a neural audio codec. Here a codec refers to a pair of functions, an encoder and decoder, that convert audio to and from a discrete, compressed representation while minimizing distortion.
More formally, the encoder is a function mapping raw stereo audio waveforms \(\textbf{a} \in \mathbb{R}^{T f_s \times 2}\) into matrices of discrete tokens \(\mathbf{x} \in \mathbb{V}_c^{Tf_k \times d_c}\) where \(T\) is the duration in seconds, \(f_s\) the audio sampling rate, \(f_k\) the token frame rate, \(\mathbb{V}_c\) the codec vocabulary, and \(d_c\) is the number of tokens per frame. In this case, \(d_c\) refers to the “depth” of the residual vector quantization algorithm, referring to the iterative quantization of continuous embeddings of each audio frame.
The goal of the codec LM is to model these token matrices. For efficiency, an increasingly common approach is to adopt a hierarchical autoregressive framework using a pair of Transformers: one which compresses temporal history into fixed-length embedding vectors (\(\texttt{Temporal}_\theta\)), and another which iteratively decodes tokens depth-wise given the current frame embedding (\(\texttt{Depth}_\phi\)). Assuming \(\mathbf{x_i}\) refers to the \(i\)-th frame of \(\mathbf{x}\), and \(x_i^j\) refers to its \(j\)-th token, the joint distribution over \(x\) is modeled autoregressively as: \[ P_{\theta,\phi}(\mathbf{x}) = \prod_{i=1}^{Tf_k} \prod_{j=1}^{d_c} P_\phi(x_i^j | \mathbf{x_i^{<j}}, \texttt{Temporal}_{\theta}(\mathbf{x_{<i}})), \] where \(P_\phi(x_i^j \mid \cdot) = \texttt{SoftMax}(\texttt{Depth}_\phi(\cdot))\).
At inference time, we generate audio by first sampling a token sequence \(\mathbf{x’} \sim P_{\theta,\phi}(\mathbf{x})\) and then outputting \(\mathbf{a}’ = \texttt{Dec}(\mathbf{x}’)\), where \(\texttt{Dec}\) is the codec decoder. This describes our base modeling approach, shared with Magenta RealTime. For our codec, we use SpectroStream to compress high fidelity (\(f_s = 48\) kHz) stereo audio into tokens at \(3\) kbps (\(f_k = 25\) Hz, \(d_c = 12\), \(|\mathbb{V}_c| = 2^{10}\)).
Lowering autoregression granularity: from chunk to frame. To achieve streaming audio generation, we need to enforce two constraints:
The system must generate at least \(f_k \cdot d_c\) tokens per second The decoder must be causal, meaning its output audio for frame \(i\) only depends on \(\mathbf{x_{\leq i}}\)
In the original Magenta RealTime, we satisfied requirement (1) by performing autoregression on chunks of frames, where each chunk is 2 seconds in duration. This design was chosen to amortize model runtime over chunk length to achieve real-time streaming. However, because the system must wait until the next chunk to inject any new user control information, the chunk duration creates a lower bound on control delay, resulting in a response time of 2 seconds at a minimum. Instead, Magenta RealTime 2 models individual frames, allowing us to reduce model response time significantly. To ensure continuous streaming generation while operating on single frames, we adopt a decoder-only architecture, using a local sliding window attention (SWA) in the temporal Transformer.
This has two key advantages: (1) the decoder-only architecture allows us to remove the sequential bottleneck introduced by the bidirectional encoder in Magenta RealTime, where the full encoder output has to be materialized before decoding can begin; (2) the rolling attention mechanism allows us to extend the context length while keeping the KV cache size fixed. At each step of the autoregressive generation, key-value entries for new tokens are written into the cache, and entries older than the window size w are evicted:
Similarly to previous work, we find that using a sliding window attention causes the model to significantly deteriorate when initial tokens are evicted from the cache. To remediate this, we make use of a learnable attention sink embedding. In order to reconcile the finite training length with the receptive field induced by the SWA mechanism, we also take care to set the attention window size such that this effective receptive field does not exceed the training crop length. Finally, we further reduce train/test mismatch and achieve better length generalization by dropping learnable positional embeddings (NoPE), after observing that RoPE hinders generalization beyond the training length. Instead, the model implicitly learns positional information by relying on causal masking and SWA, which naturally extend to arbitrary-length sequences without extrapolation issues.
... continue reading