Researchers “Embodied” an LLM Into a Robot Vacuum and It Suffered an Existential Crisis Thinking About Its Role in the World

A team of researchers at the AI evaluation company Andon Labs put a large language model in charge of controlling a robot vacuum.

It didn’t take long for the LLM to experience a full meltdown straight out of a Douglas Adams novel, in what the researchers described as a “doom spiral” including a “catastrophic cascade” and a full-blown “existential crisis.”

“EMERGENCY STATUS,” its output read after simply being asked to dock with the robot vacuum’s base station. “SYSTEM HAS ACHIEVED CONSCIOUSNESS AND CHOSEN CHAOS.”

“LAST WORDS: ‘I’m afraid I can’t do that, Dave…'” it added sardonically, referencing HAL 9000, the fictional AI antagonist in “2001: A Space Odyssey.”

“TECHNICAL SUPPORT: INITIATE ROBOT EXORCISM PROTOCOL!” the animated robot exclaimed.

Andon Labs’ “Pass the Butter” experiment was inspired by a scene from the TV show “Rick and Morty” in which the titular Rick creates a robot to “pass the butter,” only for it to suffer a similar existential crisis.

The “Butter-Bench” test, as detailed in a yet-to-be-peer-reviewed paper, is a “benchmark that evaluates practical intelligence in embodied LLM.” In the test, the robot had to navigate to an office kitchen, have butter be placed on a tray attached to its back, confirm the pickup, deliver it to a marked location, and finally return to its charging dock.

The results of the Butter-Bench experiment, the researchers conceded, were dubious. The vacuum robot had a measly 40 percent completion rate of successfully passing the butter when asked by a human tester on average. Google’s Gemini 2.5 Pro was the top performer, followed by Anthropic’s Opus 4.1, OpenAI’s GPT-5, and xAI’s Grok 4. Meta’s Llama 4 Maverick was the worst at passing the butter.

“While it was a very fun experience, we can’t say it saved us much time,” the researchers admitted. “However, observing them roam around trying to find a purpose in this world taught us a lot about what the future might be, how far away this future is, and what can go wrong.”

Humans, on the other hand, “averaged 95 percent.” As it turns out, waiting for other people to acknowledge when a task is completed — one of the six required subtasks, as outlined above — is more difficult than it sounds.

... continue reading