Multimodal learning with next-token prediction for large multimodal models

Tokenizer design

A unified tokenizer discretizes texts, images and videos into compact token sequences using shared codebooks. This enables text and vision information to reside in a common discrete space, facilitating autoregressive modelling. For text tokens and control tokens, we leveraged a byte pair encoding (BPE)-based text tokenizer for tokenization, whereas a vector quantization (VQ)-based visual tokenizer was used to discretize images and videos into compact token sequences.

Text tokenizer

For text tokenization, we adopted Qwen’s tokenizer49, which uses byte-level byte-pair encoding with a vocabulary encompassing 151,643 regular text tokens. To reserve sufficient capacity for template control, we also incorporated 211 special tokens into the tokenizer’s vocabulary.

Vision tokenizer

We trained the vision tokenizer using SBER-MoVQGAN14, which can encode a 4 × 512 × 512 video clip or a 512 × 512 image into 4,096 discrete tokens from a codebook of size 32,768. Our tokenizer achieved 4× compression in the temporal dimension and 8 × 8 compression in the spatial dimension and is applicable to any temporal and spatial resolution. Building on the MoVQGAN architecture50, we incorporated two temporal residual layers with three-dimensional convolution kernels into both the encoder and decoder modules to perform temporal downsampling and enhance video tokenization capabilities. The tokenizer was trained end-to-end on the LAION high-resolution image dataset and the InternVid51 video dataset using combined objective functions of Euclidean norm (L2) loss, learned perceptual image patch similarity (LPIPS) perceptual loss52, generative adversarial network (GAN) loss and commitment loss. Further details on video compression metrics, the impact of codebook size, and comparisons between the unified and standalone image tokenizers are provided in section 1 of the Supplementary Information.

Architecture design

Emu3 uses a decoder-only Transformer with modality-shared embeddings. We used RMSNorm53 for normalization and GQA54 for attention mechanisms, as well as the SwiGLU55 activation function and rotary positional embeddings56. Biases in the qkv and linear projection layers were removed. In addition, a dropout rate of 0.1 was implemented to improve training stability. Overall, the model contains 8.49 billion parameters, including 32 layers with a hidden size of 4,096, intermediate size of 14,336 and 32 attention heads (8 key-value heads). The shared multimodal vocabulary comprises 184,622 tokens, enabling consistent representation across language and vision domains.

Architectural comparisons with diffusion models

To fairly compare the next-token prediction paradigm with diffusion models for visual generation tasks, we used Flan-T5-XL57 as the text encoder and trained both a 1.5B diffusion transformer58,59 and a 1.5B decoder-only transformer60 on the OpenImages61 dataset. The diffusion model leverages the variational autoencoder from SDXL20, whereas the decoder-only transformer uses the video tokenizer in Emu3 to encode images into latent tokens. Both models were trained with identical configurations, including a linear warm-up of 2,235 steps, a constant learning rate of 1 × 10−4 and a global batch size of 1,024. As shown in Fig. 3c, the next-token prediction model consistently converged faster than its diffusion counterpart for equal training samples, challenging the prevailing belief that diffusion architectures are inherently superior for visual generation.

... continue reading