Tech News
← Back to articles

Qwen3-VL can scan two-hour videos and pinpoint nearly every detail

read original related products more articles

Jonathan writes for THE DECODER about how AI tools can improve both work and creative projects.

Content Summary

A few months after launching Qwen3-VL, Alibaba has released a detailed technical report on the open multimodal model. The data shows the system excels at image-based math tasks and can analyze hours of video footage.

Ad

The system handles massive data loads, processing two-hour videos or hundreds of document pages within a 256,000-token context window.

In "needle-in-a-haystack" tests, the flagship 235-billion-parameter model located individual frames in 30-minute videos with 100 percent accuracy. Even in two-hour videos containing roughly one million tokens, accuracy held at 99.5 percent. The test works by inserting a semantically important "needle" frame at random positions in long videos, which the system must then find and analyze.

Share Recommend our article Share

In published benchmarks, the Qwen3-VL-235B-A22B model often beats Gemini 2.5 Pro, OpenAI GPT-5, and Claude Opus 4.1 - even when competitors use reasoning features or high thinking budgets. The model dominates visual math tasks, scoring 85.8 percent on MathVista compared to GPT-5's 81.3 percent. On MathVision, it leads with 74.6 percent, ahead of Gemini 2.5 Pro (73.3 percent) and GPT-5 (65.8 percent).

Ad

Ad THE DECODER Newsletter The most important AI news straight to your inbox. ✓ Weekly ✓ Free ✓ Cancel at any time Please leave this field empty

... continue reading