Packing Input Frame Context in Next-Frame Prediction Models for Video Generation
Published on: 2025-08-18 14:17:48
A next-frame (or next-frame-section) prediction model looks like this:
The idea is that we can encode the input frames to some GPU layout like this:
So we have many input frames and want to diffuse some new frames.
This chart shows the logical GPU memory layout - frames images are not stitched.
Or, say the context length of each input frame.
Each frame is encoded with different patchifying kernel to achieve this.
For example, in HunyuanVideo, a 480p frame is likely 1536 tokens if using (1, 2, 2) patchifying kernel.
Then, if changed to (2, 4, 4) patchifying kernel, a frame is 192 tokens.
In this way, we can change the context length of each frame.
The "more important" frames are given more GPU resources (context length) - in this example, F0 is the most important as it is the nearest frame to the "next-frame prediction" target.
This is O(1) computation complexity for streaming - Yes, a constant, not even O(nlogn) or O(n).
... Read full article.