Mark’s Magic Multiply
This post is about a topic very near and dear to my heart. That’s right: single-precision floating-point multiplication on embedded processors. I’ll start with some background on why I’ve been so invested in this topic recently, walk through the implementations I’ve come up with on my own, and end by dissecting an absolutely ridiculous trick by Mark Owen for floating-point multiplication on 32-bit embedded cores, which was the original inspiration for this post.
⚠️ This post contains floating point. Floating point is known to the State of California to cause confusion and a fear response in mammalian bipeds. The standard recommendation is What Every Computer Scientist Should Know About Floating-Point Arithmetic. The actual IEEE 754-2008 standard is also uncharacteristically concise and readable, provided you ignore the fan fiction about radix != 2. For a more tactile experience try poking ones and zeroes into IEEE 754 Calculator (start with binary16).
Not Hard, Not Soft
Lately I’ve been working on a custom RISC‑V extension called Xh3sfx for accelerating soft floating-point routines. This is a halfway house between having an FPU and not having an FPU, which I feel is an under-explored space. You could call it firm floating point.
When you compile a C program using float variables for a target that lacks floating-point hardware support, the compiler inserts calls to a runtime library like libgcc or compiler-rt to perform the requested operations. This is sometimes called floating point emulation because it fills the role of a hardware FPU, but really it’s just one approach to implementing the floating-point operations specified in IEEE 754.
Although Xh3sfx is a custom extension, I’m not signing up to maintain and distribute a forked compiler. It’s easier to just replace the compiler runtime routines with accelerated versions. The new routines use a handful of specialised ALU operations to handle the gritty and ugly parts of floating-point formats, mixed in with regular integer instructions for the actual computation. The runtime libraries have a mostly documented and stable API surface. Adding support to your program just requires linking the acceleration library or adding its source files to your build, which is a reasonable approach for embedded firmware.
For a nominal fee of a few hundred gates, Xh3sfx gives you single-precision addition in 14 cycles and multiplication in 16 cycles, ignoring function call overhead. (It can do other stuff too, these are just examples.) Qualitatively this turns floating point from “oh god why is this so slow” to something that Just Works™ in general applications code and light audio DSP. I originally posted about it on Mastodon here. You can read about the instructions here and see some library routines here.
Multiplying with Xh3sfx
The default single-precision multiply implementation in the Xh3sfx library has the following steps:
... continue reading