Skip to content
Tech News
← Back to articles

DeiMOS – A Superoptimizer for the MOS 6502

read original get 6502 Assembly Programming Book → more articles
Why This Matters

DeiMOS is a superoptimizer designed for the MOS 6502 microprocessor, aiming to find the most efficient machine code sequences for specific tasks through exhaustive search. Its development highlights the potential for highly optimized code in legacy systems and offers insights into advanced optimization techniques that can be applied to simpler architectures. This can influence both retro computing enthusiasts and modern compiler research by demonstrating the power of superoptimization in constrained environments.

Key Takeaways

DeiMOS

DeiMOS A MOS 6502 Superoptimizer

What is a Superoptimizer Anyway? A superoptimizer is a tool that seeks to generate the optimal machine code sequence for a given computational task, aiming for the shortest or fastest possible implementation. Unlike traditional compilers that apply a set of predefined optimization rules and heuristics to improve code efficiency, a superoptimizer exhaustively searches the entire space of possible instruction sequences to find the one that perfectly meets the specified functionality with optimal performance. This exhaustive search allows superoptimizers to discover highly efficient code sequences. As you can imagine, this isn't particularly fast and scales very poorly as the programs grow longer. DeiMOS is one such program, targeting the MOS 6502.

Why the 6502? The MOS 6502 is an 8-bit microprocessor from 1975, and was extremely popular in video game consoles and computers of the time, e.g. the NES and Commodore 64. It's a nice target for a superoptimizer due to its simplicity and significance. Its instruction set is relatively small and well-defined, consisting of a limited number of opcodes and addressing modes. This simplicity reduces the complexity of the search space that the program needs to explore, making it more feasible to exhaustively analyze possible instruction sequences for optimal performance. Additionally, the 6502 lacks some of the advanced features found in modern CPUs, such as extensive registers or branch prediction, which means that there is a greater potential for a superoptimizer to uncover non-obvious, provably highly efficient code sequences.

Test Generation and Verification First, we need some way of describing the algorithm we would like to find. This is fairly straightforward – the user supplies two functions; one that generates an initial system state, and one that verifies that the output state is correct for any given input state. After a candidate program has been generated, we run the test generation, emulation, and verification once for every possible variation of the input, aborting if any should prove to be wrong. On a modern CPU this wouldn't really have worked – checking each possible value of even a single 64-bit integer would have been prohibitively slow. Since the 6502 is an 8-bit CPU however, we commonly only have to run at most 256 tests to verify that the function we've found is correct. In the majority of cases the program is rejected after the first test, so it doesn't really slow down the process at all.

An Initial Naïve Attempt The most naïve implementation possible of a superoptimizer is simply generating all combinations of bytes, and then emulating each of them in order. For a program of length 4, there are 256^4 possible programs, or about 4 billion of them. For a modern computer it is possible to check all of these in "reasonable" time, but the number of interesting programs no longer than 4 bytes isn't very large. The only things this method has going for it are that it's easy to implement, as well as the fact you can be certain that the program you found is actually the smallest one (assuming your emulator is written correctly). Other than that, it is unacceptably slow. The first straightforward optimization is to realize that while all bytes in 6502 assembler represents an instruction that does something, a whole lot of them are undocumented, and most of those just jam the CPU. If the first byte of a program candidate causes the CPU to halt, there's no point in checking any program that starts with this byte. Now we don't have to look at the rest of these 256^3 programs. The logic isn't exactly as straightforward as "if we crash, increase the current byte by 1 and set all the bytes after this one to zero", as the program might have visited deeper bytes before this one, due to branches and jumps. Instead, we store the deepest point the emulator has reached so far and increase that one. Another optimization is to not even generate an instruction that you know is useless. That way, you mostly only feed the emulator programs that will actually complete. I did this by writing a lookup table, where the key is the current opcode and the value the next valid one. All in all, this resulted in me being able to generate optimal programs of a bit more than 4 bytes. Not great, not terrible. But we can do better!

More Dakka Most modern computers have a whole lot of cores, so writing a program that only uses a single one is pretty sad. We need to support multithreading, but it's not as simple as it may seem. For the naïve implementation, this would have been easy: just split the space of possible programs into chunks, and let each core calculate their part. This doesn't work that well when you start adding optimizations that terminate early, however. Consider a cluster with 256 cores that splits the task into 256 parts, one for every core. About half of them would terminate instantly, since their first byte is an instruction that jams the (emulated) CPU. After that they'd simply sit idle waiting for the other ones to complete. The solution to this problem is having the master thread generate all possible combinations of N-byte instructions that don't crash. A thread then simply grabs one of these "prefixes" from the queue, and tests every program that starts with these N bytes. That way, all threads are maximally busy all the time, and the master process can approximate how much time is left by checking how many prefixes are left in the queue. In addition, we don't use threads: each worker is its own process, and communicates with the master process by regular TCP. The overhead is negligible and it enables us to set up arbitrarily large networked clusters.

Limiting the Instruction Generator Consider a simple program, for example finding the optimal way of multiplying an integer by five. Before we even start our superoptimizer, we know that there are a whole lot of 6502 instructions that aren't going to be useful for the task. Do we really need to generate every instruction that writes to every single memory location? Probably not. The simple solution to this is to give the user the ability to restrict what instructions will be generated. Most important here is limiting the addresses written to and read from, as well as disabling entire classes of instructions that are unlikely to be useful. This includes things like absolute jumps, "indirect" instructions, anything related to the stack, interrupts, decimal mode, or limiting the number of branches that can be generated.

Checkpointable Emulation Still, it's too slow. There's another optimization we can add, this time in the emulator. Each time the execution reaches a byte that hasn't been executed yet, we store the entire CPU and memory in a state cache. When the emulator reaches a point where it discovers that the program isn't valid – for example, it has reached the end, but the test is failing – it doesn't need to re-emulate the entire program; it can just rewind time to just before the last byte is executed, change the byte, then continue executing it from there. One issue with this is the size of the state cache. The 6502 can reference 65536 bytes of memory, and writing that to RAM every time we execute an instruction is prohibitively expensive. To solve this, we don't actually store every byte – only the range of bytes that the user has deemed is okay for us to write and/or read from. Since most programs will need at most a few bytes for scratch space, storing the state now turns out to be pretty cheap. In addition, we can strip out parts of the 6502 CPU that the user has disabled – for example, if we don't care about the stack we don't have to store the stack pointer. We also don't bother with storing the entire PC (Program Counter) register, just the offset from the base we began executing from. Now we're getting pretty fast. We're not done yet however.

Pruning Branches Most programs that we generate will start out doing stuff that a human can instantly see isn't going to work. Some of these we can detect and discard early. The most obvious ones are those that start out doing operations on registers or memory locations that have not been written to yet. To prevent this, we store a boolean flag for each register and memory location, specifying if this data is defined or not. If an instruction starts doing an operation that depends on undefined data, we discard it early. Note that the fastest program may – at least in theory – actually operate on undefined data, which means we will miss it. For example, consider a program that consists of a loop, where the first iteration of the loop operates on undefined data but the result is then coincidentally discarded or overwritten in such a way that it doesn't matter. Since I cannot prove that such programs don't exist, this optimization can be disabled in the configuration.

... continue reading