Can today’s AI video models accurately model how the real world works?

Over the last few months, many AI boosters have been increasingly interested in generative video models and their seeming ability to show at least limited emergent knowledge of the physical properties of the real world. That kind of learning could underpin a robust version of a so-called "world model" that would represent a major breakthrough in generative AI's actual operant real-world capabilities. Recently, Google's DeepMind Research tried to add some scientific rigor to how well video models can actually learn about the real world from their training data. In the bluntly titled paper "Video Models are Zero-shot Learners and Reasoners," the researchers used Google's Veo 3 model to generate thousands of videos designed to test its abilities across dozens of tasks related to perceiving, modeling, manipulating, and reasoning about the real world. In the paper, the researchers boldly claim that Veo 3 "can solve a broad variety of tasks it wasn’t explicitly trained for" (that's the "zero-shot" part of the title) and that video models "are on a path to becoming unified, generalist vision foundation models." But digging into the actual results of those experiments, the researchers seem to be grading today's video models on a bit of a curve and assuming future progress will smooth out many of today's highly inconsistent results. Passing with an 8 percent grade To be sure, Veo 3 does achieve impressive and consistent results on some of the dozens of tasks the researchers tested. The model was able to generate plausible video of actions like robotic hands opening a jar or throwing and catching a ball reliably across 12 trial runs, for instance. Veo 3 showed similarly perfect or near-perfect results across tasks like deblurring or denoising images, filling in empty spaces in complex images, and detecting the edges of objects in an image.

Can today’s AI video models accurately model how the real world works?

Share this article

Related Articles