The mechanics of autonomous software translation

2026 started with a boom of AI-assisted autonomous translations, on 14th of January, Cursor published their post on Scaling long-running autonomous coding in which they created translations of a browser, Java LSP, Windows emulator and Excel. This was followed by an Anthropic post on Building a C compiler with a team of parallel Claudes, which has only further fanned the flames of the hype. Both of these posts have garnered lots of positive attention but have failed to stand to the expectations of the demos, Cursor browser got lots of well-deserved critique, and people had their good laughs when the C compiler that could compile the Linux kernel failed on a Hello World example. I view both of these as initial attempts at translating production grade software products using an immature translation engine, and getting broken results. It seems from a speculative point of view that the models are capable enough to do this translation, so my position is that the translation harnesses themselves aren't good enough, or the total budget required is much higher. I'll go into the economics of translation in the translation as a function of money section.

To make my position clear, I think we'll get better and better demos of these autonomous translations throughout 2026, and maybe even have some decent autonomous translation products by the end of the year. This article starts with a technical dive into how these translations even work in How does AI translate software?, followed by my personal analysis of the question of "is translation capability really useful, and if yes, how so?" in How can we derive value out of translations?. In the next frontier I'll start making some more predictions, and finish with a discussion of a world where ubiquitous translation of software is possible.

I should start this section with a disclaimer to its title, AI currently does not translate software. If AI translated software, it would be like waving a magic wand, we would say "give me a Rust version of Doom", and voila, we would get one. What instead happens right now is, people use LLMs as neural search engines, AI proposes translations, which are then rejected by a translation harness, a concrete evaluator that decides if the translation has succeeded or not, that is designed by a human, in these cases experts that understand the mistakes LLMs make, and know how to create a robust testing loop with continuous improvements. This may change in the future where the harnesses themselves good enough for translation are written by the models, which is I think the point where the terminology should shift, not at this point in the history. The fundamental current change is that these translations are now economically viable because of the model capabilities. I feel the need to say this because whenever one of these translations drop to the timeline, the vendors hype it up to suggest this indicates a larger ability in the context of software engineering, which I fully disagree.

As such, let's build a very dumb translation loop without LLMs:

def translate ( source_code ) : while True : target_code = " " . join ( random . choices ( string . ascii_letters + string . digits , k = random . randint ( 1 , 1000000 ) ) ) equivalent = True for i in range ( 1000 ) : test_case = random . randbytes ( random . randint ( 1 , 1000000 ) ) if run ( source_code , test_case ) != run ( target_code , test_case ) : equivalent = False break if equivalent : return target_code

There's a very cool mathematical theorem in combinatorics that is called the infinite monkey theorem, which states that a monkey hitting keys at random on a typewriter keyboard for an infinite amount of time will almost surely type any given text, such as the complete works of William Shakespeare. We don't really care about Shakespeare's complete works right now, but we might care about translating an old COBOL software into Java without changing its semantics, so it would be immensely useful if we had these infinite monkeys with some infinite time at hour hands to produce modern Java equivalents of some old COBOL programs.

While we don't have infinite randomized labor coupled with infinite time, we have something that comes close to it, we have AI models that can generate code that are astonishingly good at following instructions. These models unsurprisingly cost a decent amount of money to run, but they make it so that we can now sample from a much better distribution of possible translations than was previously possible. We can even guide this sampling by augmenting our instructions with feedback from the testing harness, so the suggestions of the model is self-improving within this translation loop. We can modularize the software into multiple units or modules, each of which is separately validated, which means that the doesn't even have to generate the whole thing, it can generate each smaller unit and compose the whole thing together.

There's an economic balance here:

translation cost ≈ (cost per iteration) × (expected iterations until “good enough”) + (harness engineering + oversight)

As models and harnesses get better, the expected number of iterations to reach equivalence do and will get lower, and assuming that the cost to generation itself doesn't increase too much, we should expect the cost of translation to drop significantly, expecting to see many more of these results. There are two significant developments from the model side here, one is that they are getting better at following specifications, the two is that they are getting better at being in the control plane. In addition to doing the translations themselves, we can trust the model to make better judgements within the harness such as invoking subagents to translate a logical module and producing a smaller random test for it, essentially letting the model dictate how much of the translation is successfully completed. I'm not aware of any current demos that are doing this, but I would also expect this to happen.

... continue reading