Agentic Pelican on a Bicycle

The agentic loop—generate, assess, improve—seems like a natural fit for iterating on pelicans on bicycles.

Simon Willison has been running his own informal model benchmark for years: “Generate an SVG of a pelican riding a bicycle.” It’s delightfully absurd—and surprisingly revealing. Even the model labs channel this benchmark in their marketing campaigns announcing new models.

Simon’s traditional approach is zero-shot: throw the prompt at the model, get SVG back. Maybe—if you’re lucky—you get something resembling a pelican on a bicycle.

Nowadays everyone is talking about agents. Models running in a loop using tools. Sometimes they have vision capabilities, too. They can look at what they just created, cringe a little, and try again. The agentic loop—generate, assess, improve—seems like a natural fit for such a task.

So I ran a different experiment: what if we let models iterate on their pelicans? What if they could see their own output and self-correct?

The Prompt

Generate an SVG of a pelican riding a bicycle - Convert the .svg to .jpg using chrome devtools, then look at the .jpg using your vision capabilities. - Improve the .svg based on what you see in the .jpg and what's still to improve. - Keep iterating in this loop until you're satisfied with the generated svg. - Keep the .jpg for every iteration along the way.

Besides the file system and access to a command line, the models had access to Chrome DevTools MCP server (for SVG-to-JPG conversion) and their own multimodal vision capabilities. They could see what they’d drawn, identify problems, and iterate. The loop continued until they declared satisfaction.

I used the Chrome DevTools MCP server to give every model the same rasterizer. Without this, models would fall back to whatever SVG-to-image conversion they prefer or have available locally—ImageMagick, Inkscape, browser screenshots, whatever. Standardizing the rendering removes one variable from the equation.

The prompt itself is deliberately minimal. I could have steered the iterative loop with more specific guidance—“focus on anatomical accuracy,” “prioritize mechanical realism,” “ensure visual balance.” But that would defeat the point. Simon’s original benchmark is beautifully unconstrained, and I wanted to preserve that spirit. The question isn’t “can models follow detailed improvement instructions?” It’s “when left to their own judgment, what do they choose to fix?”

... continue reading