Twilight of the Gods. Fable and 10 more LLMs on a Code Reorganization Task. Comparison.¶
Other languages Эта статья также доступна на русском: Гибель богов.
Materials & raw data All 11 model proposals, the cross-reviews, the theses runs, and the ranking script are published here: Materials & reproduce this experiment.
This is a detailed write-up of one experiment. I took a god node from a real LangGraph agent and asked 5 American and 6 Chinese models first to propose how to untangle it, then to evaluate each other's proposals. After that, I tried three different ways to figure out which of them to trust on the matter.
Contents
The original problem¶
You know how it goes: you're building a practice AI agent with the fellas on a course by Data Sanity, and amid the colorful whirl of rapidly accreting features you suddenly notice that one of the project's internal agents has a state graph (LangGraph) that looks like this:
flowchart TD planner_start([START]) --> plan[plan] plan -->|search| search[search] plan -->|ask_user| ask_user[ask_user / interrupt] plan -->|reflect| reflect[reflect] plan -->|calculate| calculate[calculate] plan -->|finish| finish[finish] search -->|last_observation| observe[observe] search -->|no hits / backend failure| plan observe --> plan calculate --> plan ask_user --> observe_user[observe_user] observe_user --> plan reflect --> plan finish --> planner_end([END])
At first glance this is just a cute little octopus — nothing to worry about. But once you know how much logic this octopus has to hold in its modest eight-legged head, it becomes clear right away that we're looking at an anti-pattern. In this case, let's call it a god node.
The plan node hides about 350 lines of logic, including iterative checks, bootstrap questions about region and currency, schema preparation, acquisition-task routing, the LLM call, the subsequent correction of the decision, and so on.
... continue reading