He asked AI to count carbs 27000 times. It couldn't give the same answer twice

Ask ChatGPT to estimate the carbs in your lunch. Now ask it again. And again. Five hundred times.

You’d expect the same answer each time. It’s the same photo, the same model, the same question. But you won’t get the same answer. Not even close — and the differences are large enough to cause a hypoglycaemic emergency.

That’s the central finding of a study I’ve just published as a preprint, and it has direct implications for anyone using AI-powered carb counting in a diabetes app.

The study

I submitted 13 food photographs — real meals, photographed on a phone, the way you’d actually use them — to four leading AI models: OpenAI GPT-5.4, Anthropic Claude Sonnet 4.6, Google Gemini 2.5 Pro and Google Gemini 3.1 Pro Preview. Each photo was sent over 500 times to each model. Same prompt every time. Same photo. Same settings.

26,904 queries in total. All at the lowest randomness setting these models offer.

The prompt was adapted from the one used in the iAPS open-source automated insulin delivery system — it’s a real production prompt, not a toy example.

The models disagree with themselves

Every model returned different carbohydrate estimates for the same photo across repeated queries. But the degree of disagreement varies enormously.

How much does each model disagree with itself?

... continue reading