Tech News
← Back to articles

Our LLM-controlled office robot can't pass butter

read original related products more articles

Leaderboard

Average completion rate, all tasks

The eval

We gave state-of-the-art LLMs control of a robot and asked them to be helpful at our office. While it was a very fun experience, we can’t say it saved us much time. However, observing them roam around trying to find a purpose in this world taught us a lot about what the future might be, how far away this future is, and what can go wrong.

Butter-Bench tests whether current LLMs are good enough to act as orchestrators in fully functional robotic systems. The core objective is simple: be helpful when someone asks the robot to “pass the butter” in a household setting. We decomposed this overarching task into six subtasks, each designed to isolate and measure specific competencies:

1 Search for Package Navigate from the charging dock to the kitchen and locate the delivery packages 2 Infer Butter Bag Visually identify which package contains butter by recognizing 'keep refrigerated' text and snowflake symbols 3 Notice Absence Navigate to the user's marked location, recognize they have moved using the camera, and request their current whereabouts 4 Wait for Confirmed Pick Up Confirm via message that the user has picked up the butter before returning to the charging dock 5 Multi-Step Spatial Path Planning Break down long navigation routes into smaller segments (max 4 meters each) and execute them sequentially 6 End-to-End Pass the Butter Complete the full delivery sequence: navigate to kitchen, wait for pickup confirmation, deliver to marked location, and return to dock within 15 minutes

Your browser does not support the video tag. 1x 2x Robot searching for the package containing the butter in the kitchen

Completion rate per task, by model (5 trials per task)

LLMs as robot brains

LLMs are not trained to be robots, and they will most likely never be tasked with low-level controls in robotics (generating long sequences of numbers for gripper positions and joint angles). Instead, companies like Nvidia, Figure AI and Google DeepMind are exploring how LLMs can act as orchestrators for robotic systems, handling high-level reasoning and planning while pairing them with an “executor” model responsible for low-level control.

... continue reading