4Real-Video-V2: Feedforward Reconstruction for 4D Scene Generation

1Snap Inc. 2KAUST

4Real-Video-V2 is capable of computing a 4D spatio-temporal grid of video frames and 3D Gaussian particles for each time step using a feed-forward architecture. Its architecture has two main components, a 4D video diffusion model and a feedforward reconstruction model.

Your browser does not support the video tag.

This represents a major upgrade over 4Real-Video, introducing a new 4D video diffusion model architecture that adds no additional parameters to the base video model. The key to the new design is a sparse attention pattern, where tokens attend to others in the same frame, at the same timestamp, or from the same viewpoint. This design makes it easily scalable to large pre-trained video models, efficient to train and offers good generalization.