SnapBench
Inspired by Pokémon Snap (1999). VLM pilots a drone through 3D world to locate and identify creatures.
Architecture
%%{init: {'theme': 'base', 'themeVariables': { 'background': '#ffffff', 'primaryColor': '#ffffff'}}}%% flowchart LR subgraph Controller["**Controller** (Rust)"] C[Orchestration] end subgraph VLM["**VLM** (OpenRouter)"] V[Vision-Language Model] end subgraph Simulation["**Simulation** (Zig/raylib)"] S[Game State] end C -->|"screenshot + prompt"| V C <-->|"cmds + state
**UDP:9999**"| S style Controller fill:#8B5A2B,stroke:#5C3A1A,color:#fff style VLM fill:#87CEEB,stroke:#5BA3C6,color:#1a1a1a style Simulation fill:#4A7C23,stroke:#2D5A10,color:#fff style C fill:#B8864A,stroke:#8B5A2B,color:#fff style V fill:#B5E0F7,stroke:#87CEEB,color:#1a1a1a style S fill:#6BA33A,stroke:#4A7C23,color:#fff Loading
Overview
The simulation generates procedural terrain and spawns creatures (cat, dog, pig, sheep) for the drone to discover. It handles drone physics and collision detection, accepting 8 movement commands plus identify and screenshot . The Rust controller captures frames from the simulation, constructs prompts enriched with position and state data, then parses VLM responses into executable command sequences. The objective: locate and successfully identify 3 creatures, where identify succeeds when the drone is within 5 units of a target.
demo_3x.mov
Gotta catch 'em all?
I gave 7 frontier LLMs a simple task: pilot a drone through a 3D voxel world and find 3 creatures.
Only one could do it.
... continue reading