From hallucinations to hardware: Lessons from a real-world computer vision project gone sideways

Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more

Computer vision projects rarely go exactly as planned, and this one was no exception. The idea was simple: Build a model that could look at a photo of a laptop and identify any physical damage — things like cracked screens, missing keys or broken hinges. It seemed like a straightforward use case for image models and large language models (LLMs), but it quickly turned into something more complicated.

Along the way, we ran into issues with hallucinations, unreliable outputs and images that were not even laptops. To solve these, we ended up applying an agentic framework in an atypical way — not for task automation, but to improve the model’s performance.

In this post, we will walk through what we tried, what didn’t work and how a combination of approaches eventually helped us build something reliable.

Where we started: Monolithic prompting

Our initial approach was fairly standard for a multimodal model. We used a single, large prompt to pass an image into an image-capable LLM and asked it to identify visible damage. This monolithic prompting strategy is simple to implement and works decently for clean, well-defined tasks. But real-world data rarely plays along.

We ran into three major issues early on:

Hallucinations : The model would sometimes invent damage that did not exist or mislabel what it was seeing.

: The model would sometimes invent damage that did not exist or mislabel what it was seeing. Junk image detection : It had no reliable way to flag images that were not even laptops, like pictures of desks, walls or people occasionally slipped through and received nonsensical damage reports.

: It had no reliable way to flag images that were not even laptops, like pictures of desks, walls or people occasionally slipped through and received nonsensical damage reports. Inconsistent accuracy: The combination of these problems made the model too unreliable for operational use.

... continue reading