DeiMOS

What is a Superoptimizer Anyway?

A superoptimizer is a tool that seeks to generate the optimal machine code sequence for a given computational task, aiming for the shortest or fastest possible implementation.

Unlike traditional compilers that apply a set of predefined optimization rules and heuristics to improve code efficiency, a superoptimizer exhaustively searches the entire space of possible instruction sequences to find the one that perfectly meets the specified functionality with optimal performance. This exhaustive search allows superoptimizers to discover highly efficient code sequences.

As you can imagine, this isn't particularly fast and scales very poorly as the programs grow longer.

DeiMOS is one such program, targeting the MOS 6502.

Why the 6502?

The MOS 6502 is an 8-bit microprocessor from 1975, and was extremely popular in video game consoles and computers of the time, e.g. the NES and Commodore 64.

It's a nice target for a superoptimizer due to its simplicity and significance. Its instruction set is relatively small and well-defined, consisting of a limited number of opcodes and addressing modes. This simplicity reduces the complexity of the search space that the program needs to explore, making it more feasible to exhaustively analyze possible instruction sequences for optimal performance.

Additionally, the 6502 lacks some of the advanced features found in modern CPUs, such as extensive registers or branch prediction, which means that there is a greater potential for a superoptimizer to uncover non-obvious, provably highly efficient code sequences.

Test Generation and Verification

First, we need some way of describing the algorithm we would like to find. This is fairly straightforward – the user supplies two functions; one that generates an initial system state, and one that verifies that the output state is correct for any given input state. After a candidate program has been generated, we run the test generation, emulation, and verification once for every possible variation of the input, aborting if any should prove to be wrong.

On a modern CPU this wouldn't really have worked – checking each possible value of even a single 64-bit integer would have been prohibitively slow. Since the 6502 is an 8-bit CPU however, we commonly only have to run at most 256 tests to verify that the function we've found is correct. In the majority of cases the program is rejected after the first test, so it doesn't really slow down the process at all.

An Initial Naïve Attempt

The most naïve implementation possible of a superoptimizer is simply generating all combinations of bytes, and then emulating each of them in order.

For a program of length 4, there are 256^4 possible programs, or about 4 billion of them. For a modern computer it is possible to check all of these in "reasonable" time, but the number of interesting programs no longer than 4 bytes isn't very large.

The only things this method has going for it are that it's easy to implement, as well as the fact you can be certain that the program you found is actually the smallest one (assuming your emulator is written correctly). Other than that, it is unacceptably slow.

The first straightforward optimization is to realize that while all bytes in 6502 assembler represents an instruction that does something, a whole lot of them are undocumented, and most of those just jam the CPU. If the first byte of a program candidate causes the CPU to halt, there's no point in checking any program that starts with this byte. Now we don't have to look at the rest of these 256^3 programs.

The logic isn't exactly as straightforward as "if we crash, increase the current byte by 1 and set all the bytes after this one to zero", as the program might have visited deeper bytes before this one, due to branches and jumps. Instead, we store the deepest point the emulator has reached so far and increase that one.

Another optimization is to not even generate an instruction that you know is useless. That way, you mostly only feed the emulator programs that will actually complete. I did this by writing a lookup table, where the key is the current opcode and the value the next valid one.

All in all, this resulted in me being able to generate optimal programs of a bit more than 4 bytes. Not great, not terrible. But we can do better!

More Dakka

Most modern computers have a whole lot of cores, so writing a program that only uses a single one is pretty sad. We need to support multithreading, but it's not as simple as it may seem.

For the naïve implementation, this would have been easy: just split the space of possible programs into chunks, and let each core calculate their part. This doesn't work that well when you start adding optimizations that terminate early, however. Consider a cluster with 256 cores that splits the task into 256 parts, one for every core. About half of them would terminate instantly, since their first byte is an instruction that jams the (emulated) CPU. After that they'd simply sit idle waiting for the other ones to complete.

The solution to this problem is having the master thread generate all possible combinations of N-byte instructions that don't crash. A thread then simply grabs one of these "prefixes" from the queue, and tests every program that starts with these N bytes. That way, all threads are maximally busy all the time, and the master process can approximate how much time is left by checking how many prefixes are left in the queue.

In addition, we don't use threads: each worker is its own process, and communicates with the master process by regular TCP. The overhead is negligible and it enables us to set up arbitrarily large networked clusters.

Limiting the Instruction Generator

Consider a simple program, for example finding the optimal way of multiplying an integer by five. Before we even start our superoptimizer, we know that there are a whole lot of 6502 instructions that aren't going to be useful for the task. Do we really need to generate every instruction that writes to every single memory location? Probably not.

The simple solution to this is to give the user the ability to restrict what instructions will be generated. Most important here is limiting the addresses written to and read from, as well as disabling entire classes of instructions that are unlikely to be useful. This includes things like absolute jumps, "indirect" instructions, anything related to the stack, interrupts, decimal mode, or limiting the number of branches that can be generated.

Checkpointable Emulation

Still, it's too slow. There's another optimization we can add, this time in the emulator.

Each time the execution reaches a byte that hasn't been executed yet, we store the entire CPU and memory in a state cache. When the emulator reaches a point where it discovers that the program isn't valid – for example, it has reached the end, but the test is failing – it doesn't need to re-emulate the entire program; it can just rewind time to just before the last byte is executed, change the byte, then continue executing it from there.

One issue with this is the size of the state cache. The 6502 can reference 65536 bytes of memory, and writing that to RAM every time we execute an instruction is prohibitively expensive.

To solve this, we don't actually store every byte – only the range of bytes that the user has deemed is okay for us to write and/or read from. Since most programs will need at most a few bytes for scratch space, storing the state now turns out to be pretty cheap. In addition, we can strip out parts of the 6502 CPU that the user has disabled – for example, if we don't care about the stack we don't have to store the stack pointer. We also don't bother with storing the entire PC (Program Counter) register, just the offset from the base we began executing from.

Now we're getting pretty fast. We're not done yet however.

Pruning Branches

Most programs that we generate will start out doing stuff that a human can instantly see isn't going to work. Some of these we can detect and discard early.

The most obvious ones are those that start out doing operations on registers or memory locations that have not been written to yet. To prevent this, we store a boolean flag for each register and memory location, specifying if this data is defined or not. If an instruction starts doing an operation that depends on undefined data, we discard it early.

Note that the fastest program may – at least in theory – actually operate on undefined data, which means we will miss it. For example, consider a program that consists of a loop, where the first iteration of the loop operates on undefined data but the result is then coincidentally discarded or overwritten in such a way that it doesn't matter. Since I cannot prove that such programs don't exist, this optimization can be disabled in the configuration.

Warp Speed

Another class of programs that one could prove won't be able to work are those that overwrite their inputs – there is nothing a program that's supposed to "multiply A with 3" can do to succeed if it starts out by writing a zero to A.

To completely solve this scenario, we have to complicate our emulator a bit.

First, instead of emulating a single test case, we emulate multiple test cases simultaneously. Our emulator now becomes a bunch of separate 6502 CPUs running in parallel. This isn't as slow as you might think; all the instructions are now easily vectorized over an array of cores instead which is pretty fast.

We call this multi-CPU-group a Warp, as that is already a very similar concept for GPU programming. Just as for the GPU, if we ever encounter a branch that only some of the warp members take, the warp gets split in two and emulated separately.

Before we can use the warp for anything productive, we need to track a second piece of information – the mapping from each test case input state to each output state. For example, if one wanted to find a fast popcnt method, there would be 256 different inputs to 9 different outputs – one for each possible count of set bits. Each input would then store what output index it expects to result in; for example, both 0x03 and 0x05 would have output index 2, as both have 2 bits set.

Finally, as the emulation progresses, after each instruction we sort each entry in the warp on the CPU and memory state. This means that we can now easily find if two entries in the warp have the exact same state. If so, we can merge them together so that we no longer have to emulate each one.

The great benefit of this is that we can now detect if two test case inputs with different expected output indexes try to merge together. This always means that the program cannot succeed; and we can discard this entire branch early.

So, putting it all together – suppose we're optimizing the aforementioned popcnt. After executing the first instruction, inputs 0x03 and 0x06 might both produce the same CPU state. Since both expect output 2, we can merge them into a single warp entry. But, if 0x03 and 0x04 produce the same state – and they expect outputs 2 and 1 respectively – we know no future instruction sequence can produce different outputs from identical states, so we prune immediately.

Shadow Instructions

Before I can explain the next optimization, I should describe what I call shadow instructions.

6502 instructions can be one, two, or three bytes large. What happens if you've got a branch instruction that jumps into the second or third byte of a longer instruction? It'll get interpreted as an instruction. This is quite interesting, as we can use it to squeeze down the size of programs by using opcode arguments as opcodes itself, having them serve two purposes at once.

To make this faster, we pre-calculate a table of opcode arguments that are simultaneously e.g. a valid memory location and 1-byte instruction. When an instruction is generated, it first checks if its argument is the target of any branch in the program, and if so uses the more limited lookup table. While we're at it, we also prevent the branch from jumping into itself – something which will always cause an infinite loop – as well as jumping to the very next instruction, as that would be the same as a no-op.

Branch Templates

To enable the next class of optimizations, we simplify the emulator by first writing branch templates. A branch template describe how many, where they are, and to what memory all branches in a program point to. For each program size, we generate all possible combinations of all branch templates beforehand, and then go through them one by one.

The first branch template is always the one where no branches exist, so we always check the straightforward case first where the emulation simply proceeds downwards with no control flow at all. Then we check the case where a branch exists on the first byte, targeting the first valid branch target and so on until all templates are exhausted.

Why bother with this? It's because we can now know in advance when we're generating instructions if the instruction we are generating will be the target of a branch. For example, consider the situation where we generate a "Load the value 0x02 into register A"-instruction. This is two bytes, with the second being 0x02. If we in advance know that the second byte is a branch target however, we also know that the program won't be valid, as 0x02 is one of those instructions that crash the CPU.

Eliminating Degenerate Sequences

With the branch templates generated beforehand we can now proceed with more optimizations. Since we know beforehand where all loops are located, it is now possible to split the program into multiple linear sections that don't contain control flow.

When generating code inside one such linear section, we can apply optimization rules that detect cases that we know won't be optimal. Consider the case where the previous instruction was the "Clear Carry Flag" instruction. Since generating another "Clear Carry Flag" instruction won't do anything at all, we know this cannot possibly be optimal and can discard that possibility.

In other cases there may be multiple different ways of reaching the same end result; for example, clearing the carry flag and overflow flag may be done in any order. In those cases we select a preferred ordering and discard the other branch.

Unofficial Instructions

A fair amount of the instructions that the 6502 supports are unofficial; they weren't documented and their effects (if any) were unintended by the designers, but some actually do various weird and interesting things.

In any case, programmers usually avoid them for several reasons:

Some are unreliable, in that they don't always perform the same operation for every variant or even batch of the 6502.
Their effects are often odd, and keeping them in mind when writing code is difficult. Consider the INS instruction (0xE7) – it takes a byte from memory, adds one to it, then decrements the result from the A register. Most of them do rather situationally useful stuff like that.
There is no universally accepted naming scheme for them, and code you write will likely not be portable between assemblers.
Emulators don't always support them. This could be important when e.g. writing games, as you want your code to run everywhere.

We support optionally generating these instructions, and an option for interpreting them like a specific 6502 flavour; for instance a NES if you are planning to write a game for it.

Samples

Multiply the A register with 10

The straightforward way to multiply by 10 is to compute 8x and 2x separately, then add them together. This takes about 8 bytes. The superoptimizer found a 7-byte sequence that uses the unofficial instruction RRA to shave off a byte:

First, two ASL A instructions shift A left twice, computing 4x – but between them, the intermediate value 2x is stashed in memory. The RRA $00 instruction then does two things in a single opcode: it rotates the stored 2x right (recovering x, with the carry flag from the preceding shift ensuring this works correctly even when values overflow), and adds the result back into A – giving 4x + x = 5x. A final ASL A doubles this to 10x.

A human programmer might not have thought to use RRA here, as it's a quirky unofficial instruction that combines a rotate with an addition.

C000 0A    ASL A
C001 85    STA $00
C002 00
C003 0A    ASL A
C004 67 U  RRA $00
C005 00
C006 0A    ASL A

max(A, X) -> A

This example showcases a shadow instruction. The byte 8A at address C001 serves two different purposes depending on how execution reaches it.

In normal sequential execution, 8A is never decoded as an opcode at all. It's consumed as the zero-page address operand of STX, and later by CMP. The program stores X into address $8A, then compares A against it. If A >= X, the carry flag is set, the BCC branch is not taken, and execution falls through with A unchanged, which is already the maximum.

If A < X however, the carry flag is clear and BCC jumps back to address C001. Now, instead of being read as an operand, the byte 8A is decoded as an opcode – TXA, which copies X into A. Execution then falls through into the same CMP and BCC again, but this time A and the stored value are equal, so the carry is set and the loop exits. A now holds X, the larger value.

C000 86    STX $8A
C001 8A  S TXA
C002 C5    CMP $8A
C003 8A
C004 90    BCC $FB (C001)
C005 FB

Wrapping up

The code itself is written in Zig^*, EUPL-1.2 licensed, and can be found in the GitHub repo here. It works well enough, though some optimizations – such as the degenerate sequence filter – have room for improvement. In addition it does not yet support 3-byte opcodes, or operations modifying the stack.

What program size can you expect to synthesize with this? For my 8-Core AMD Ryzen 7 3800X, I can generate sequences up to ~11 bytes if I let it crunch for an hour or so; but much of this depends on what types of limitations I put on the optimizer. For instance, trying every possible combination of zero-page memory address is significantly slower than just giving it a few low bytes to play around with.

What if you want to generate the fastest and not necessarily the shortest code? No problem. The tool outputs the pareto frontier of all programs that are either faster or shorter than any previously found program.

* Why Zig? It's because ~~I'm a Zig fanboy~~ it's as fast as C, and the compile-time features really shine here. It can basically construct a custom emulator for each task you give it – like only including certain registers, memory addresses, and opcodes in the precomputed tables if you are interested in tracking them – something which would have required an unsustainably massive amount of preprocessor boilerplate if I had used C.