Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more
While large language models (LLMs) have mastered text (and other modalities to some extent), they lack the physical “common sense” to operate in dynamic, real-world environments. This has limited the deployment of AI in areas like manufacturing and logistics, where understanding cause and effect is critical.
Meta’s latest model, V-JEPA 2, takes a step toward bridging this gap by learning a world model from video and physical interactions.
V-JEPA 2 can help create AI applications that require predicting outcomes and planning actions in unpredictable environments with many edge cases. This approach can provide a clear path toward more capable robots and advanced automation in physical environments.
How a ‘world model’ learns to plan
Humans develop physical intuition early in life by observing their surroundings. If you see a ball thrown, you instinctively know its trajectory and can predict where it will land. V-JEPA 2 learns a similar “world model,” which is an AI system’s internal simulation of how the physical world operates.
model is built on three core capabilities that are essential for enterprise applications: understanding what is happening in a scene, predicting how the scene will change based on an action, and planning a sequence of actions to achieve a specific goal. As Meta states in its blog, its “long-term vision is that world models will enable AI agents to plan and reason in the physical world.”
The model’s architecture, called the Video Joint Embedding Predictive Architecture (V-JEPA), consists of two key parts. An “encoder” watches a video clip and condenses it into a compact numerical summary, known as an embedding. This embedding captures the essential information about the objects and their relationships in the scene. A second component, the “predictor,” then takes this summary and imagines how the scene will evolve, generating a prediction of what the next summary will look like.
V-JEPA is composed of an encoder and a predictor (source: Meta blog)
This architecture is the latest evolution of the JEPA framework, which was first applied to images with I-JEPA and now advances to video, demonstrating a consistent approach to building world models.
... continue reading