(See the full results at compilebench.com)
Now on the front page of Hacker News — join the discussion.
When ChatGPT first launched in 2022, it could barely write short snippets of working code. Today, the best LLMs can generate entire applications from scratch and even win prestigious coding competitions (like IOI 2025).
But can they tackle the messy reality of software development – dependency hell, legacy toolchains, and cryptic compile errors? We created CompileBench to find out.
Based on XKCD 2347 ("Dependency").
We tested 19 state-of-the-art LLMs on 15 real-world tasks using the unmodified source code of open-source projects like curl (HTTP client) and jq (command-line JSON processor).
The goal sounds straightforward – produce a working binary. But achieving it can be surprisingly complex. Our toughest challenges include cross-compiling to Windows or ARM64 and resurrecting 22-year-old source code from 2003 on modern systems. Some agents needed 135 commands and 15 minutes just to produce a single working binary.
See the full results later in the article.
The Tasks
Each task in CompileBench follows the same structure. We give the LLM agent:
... continue reading