A verification layer for browser agents: Amazon case study

This post is a technical report on four runs of the same Amazon shopping flow. The purpose is to isolate one claim: reliability comes from verification, not from giving the model more pixels or more parameters.

Sentience is used here as a verification layer: each step is gated by explicit assertions over structured snapshots. This makes it feasible to use small local models as executors, while reserving larger models for planning (reasoning) when needed. No vision models are required for the core loop in the local runs discussed below.

Key findings

Finding Evidence (from logs / report) A fully autonomous run can complete with local models when verification gates every step. Demo 3 re-run: Steps passed: 7/7 and success: True Token efficiency can be engineered by interface design (structure + filtering), not by model choice. Demo 0 report: estimated ~35,000 → 19,956 tokens (~43% reduction) Verification > intelligence is the practical lesson. Planner drift is surfaced as explicit FAIL/mismatch rather than silent progress

Your browser does not support the video tag. End-to-end run clip (Amazon: search → first product → add to cart → checkout), with verification gating each step.

Key datapoints:

Metric Demo 0 (cloud baseline) Demo 3 (local autonomy) Success 1/1 run 7/7 steps (re-run) Duration ~60,000ms 405,740ms Tokens 19,956 (after filtering) 11,114

Task (constant across runs): Amazon → Search “thinkpad” → Click first product → Add to cart → Proceed to checkout

First principles: structure > pixels

Screenshot-based agents use pixels as the control plane. That often fails in predictable ways: ambiguous click targets, undetected navigation failures, and “progress” without state change.

... continue reading