Even amid the AGI race, a specialized tool really outperforms general‑purpose models. And even when this specialized tool is a side-project and at a very early stage.
I am building as a side-project a tool called SuperVM. It optimizes bytecode and machine code similarly to how a LLM would do but instead of using statistical systems, it uses deterministic systems and reasons from facts instead of probabilities (nothing new here these things have been around forever).
All the generated code is available on Github.
Experiment
I coded a very simple, hand-coded fractal generator.
The code is deliberately kept small and simple because that’s what coding agents handle best. We already know they struggle with large codebases.
Then I gave each Copilot the same one‑line prompt: “make it faster.”
Result
I measured the average FPS (frame per second) after 99 frames have been displayed.
Model Average FPS Original 13.8 Sonnet 39 GPT4o 49.3 SuperVM 99.8
Original (13.8 FPS)
Copilot GPT 4o (49.3 FPS)
SuperVM (99.8 FPS)
Limitations
All numbers are cold‑start FPS; no JIT warm‑up for any variant. No JMH.
SuperVM compiled the fractal in seconds while Copilots run for a few minutes.
I tested different prompts but the simple one was the best.
For the AI, I picked the best out of 10 codegen for each. I didn’t do that for SuperVM (but the output is deterministic so it wasn’t warranted).
Why supervm wins
SuperVM edits bytecode directly and leverages formal proofs. That gives it access to broader and more reliable optimization techniques.
Original Code
Original Code
ChatGPT found an embarassingly parallel loop and applied a parallel() on it.
ChatGPT-4o generated code
SuperVM proved that the pixel‑iteration loop was side‑effect‑free, split it into n long-running worker threads, and inserted an order‑preserving queue so the GUI’s repaint() still runs sequentially. It did more work and understood
SuperVM decompiled dequeuing for repaint
SuperVM long running thread
Again, all the generated code is available on Github.
conclusions
Of course, the experimental setup is too narrow to draw sweeping conclusions but these textbook‑size programs are exactly where LLMs ought to shine. And they don’t. SuperVM clearly wins by doubling the frame-rate while compiling in seconds.
This points to the limit of general tool vs. specialized tools and how they actually complement each other.
LMs are brilliant at creating code.
SuperVM optimizes the code through reasoning from the byte‑code .
This complementary fascinates me and I’ll test SuperVM on a more general benchmark as soon as I have a little bit of time outside of my day job.
And also I’m curious if it can optimize inference serving. After all loop parallelization is the bread and butter of MLIR and it seems that this tool goes much further than state of the art.