Optimizing a 6502 image decoder, from 70 minutes to 1 minute

When I set out to write a program that would allow me to do basic digital photography on the Apple II, I decided I would do it with the Quicktake cameras. It seemed the obvious choice as they were Apple cameras, and their interface to the computer is via serial port.

The scope creeped a bit after managing to decode Quicktake 100 photos. I wanted it to be able to decode Quicktake 150 and Quicktake 200 pictures too. This threw me into image processing much more than I initially wanted to. This article explains the process of how I got the Quicktake 150 decoder to a reasonabl-ish speed on a 6502 at 1MHz.

The Quicktake 150 format is proprietary and undocumented. Free software decoders exist though, in the dcraw project. This was my source for the initial Apple II decoder. Sadly, it is are written in C, is extremely non-documented, and extremely (to me!) hard to read and understand. The compression is based on Huffman coding, with variable-length codes (which means bit-shifting), and the image construction involves a lot of 16-bits math. None of this is good on a 6502.

But first I had to rework the original algorithm to work with bands of 20 pixels, for memory reasons. Once I had a functional decoder, it ran perfectly, but it took… seventy minutes to decode a single picture.

A Quicktake 150 picture, fully decoded

The same picture, dithered for display on a monochrome Apple II

Of course, I didn’t get there that fast. The first implementation was released two years ago, in November 2023. Getting where I’m now took, I think, five or six deep dives with each time, one or two weeks worth of late evenings and full week-ends dedicated to progressing, wading through hundreds or thousands of debug printf()s, gdb’ing, variables and offsets comparisons, etc.

Follow me through the algorithmic iterations that allowed me to get that decoding time to under one minute. My implementation is now full assembly, but the commits I will link to here are to the general decoding algorithm, that is easier to read in C.

I have noticed that hand-optimizing assembler yields good results, but usually optimizing the algorithm itself leads to much more impressive speed gains. Doing too many things faster is not as good as doing the minimum faster. And that Quicktake 150 decoder sure did useless things, especially in my case where I don’t care about color and end up with a 256×192 image!

I have made a specific repository to track these algorithmic changes. It started here (already a little bit deobfuscated compared to dcraw), at 301 millions x86_64 instructions.

... continue reading