Bitwise conversion of doubles using only FP multiplication and addition (2020)

In the words of Tom Lehrer, “this is completely pointless, but may prove useful to some of you some day, perhaps in a somewhat bizarre set of circumstances.”

The problem is as follows: suppose you’re working in a programming environment that provides only an IEEE-754 double-precision floating point (“double”) type, and no operations that can access that type’s representation (such as C++ bitwise cast operations, or Javascript’s DataView object). You have a double and you want to convert it to its bitwise representation as two unsigned 32-bit integers (stored as doubles), or vice versa. This problem comes up from time to time, but I was curious about a different question: how restricted can your programming environment be? Could you do it with just floating point multiplication and addition?

Bitwise conversion using floating point operations can be useful in situations like limited interpreted languages, or C++ constexpr contexts. Generally double to int conversion can be done using a binary search, comparing with powers of two to figure out the bits of the exponent. From there the fraction bits can be extracted, either by binary searching more, or using the knowledge of the exponent to scale the fraction bits into the integer range.

But can it be done without bitwise operations, branches, exponentiation, division, or floating point comparisons?

It seemed improbable at first, but I’ve discovered the answer is yes, multiplication and addition are mostly sufficient, although with a few notable caveats. Even without these restrictions different NaN values cannot be distinguished or generated (without bitwise conversion) in most environments, but using only multiplication and addition it is impossible to convert NaN, Infinity or -Infinity into an unsigned 32-bit value. The other problematic value is “negative zero”, which cannot be differentiated from “positive zero” using addition and multiplication. All my code uses subtraction, although it could be removed by substituting a - b with a + (b * -1) . And finally, this relies on IEEE-754 operations (in the usual rounding mode, “round to nearest, ties to even”), so it wouldn’t work in environments that use unsafe maths optimisations (the default in shader compilers, or enabled by a flag such as /fp:fast in many other compilers).

So, if you just need a solution, here it is, but otherwise stick around for an explanation:

function double_as_uint32s(double) { // Doesn't handle NaN, Infinity or -Infinity. Treats -0 as 0. var a = double, b, c, d, e, f, g, h, i, j, k, l, m, n, low, high; f=2.2250738585072014e-308+a; j=5e-324; b=j+f; b-=f; m=-5e-324; d=m+b; b=4.4989137945431964e+161; d=b*d; d=b*d; g=d*d; d=1.0; g=d-g; h=m+f; f=h-f; f=j+f; f=b*f; f=b*f; f*=f; f=d-f; f*=g; g=-2.2250738585072014e-308+a; h=j+g; h-=g; h=m+h; h=b*h; h=b*h; h*=h; h=d-h; c=m+g; c-=g; c=j+c; c=b*c; c=b*c; c*=c; c=d-c; c*=h; k=c*f; c=5.562684646268003e-309*a; g=j+c; g-=c; g=m+g; g=b*g; g=b*g; g*=g; g=d-g; h=m+c; h-=c; h=j+h; h=b*h; h=b*h; h*=h; h=d-h; g=h*g; h=a*g; g=d-g; c=g*c; g=1024.0*g; f=2.0+g; c+=h; h=7.458340731200207e-155*c; l=1.0000000000000002; g=l*h; g=m+g; e=j+g; e-=g; e=b*e; e=b*e; c=e*c; e=d-e; g=e*h; c=g+c; e=512.0*e; g=8.636168555094445e-78*c; e+=f; f=l*g; f=m+f; h=j+f; f=h-f; f=b*f; f=b*f; c=f*c; f=d-f; g=f*g; f=256.0*f; c=g+c; e=f+e; f=2.938735877055719e-39*c; g=l*f; g=m+g; h=j+g; g=h-g; g=b*g; g=b*g; c=g*c; g=d-g; f=g*f; c=f+c; f=128.0*g; g=5.421010862427522e-20*c; e=f+e; f=l*g; f=m+f; h=j+f; f=h-f; f=b*f; f=b*f; c=f*c; f=d-f; g=f*g; f=64.0*f; c=g+c; e=f+e; i=2.3283064365386963e-10; f=i*c; g=l*f; g=m+g; h=j+g; g=h-g; g=b*g; g=b*g; c=g*c; g=d-g; f=g*f; c=f+c; f=32.0*g; g=1.52587890625e-05*c; e=f+e; f=l*g; f=m+f; h=j+f; f=h-f; f=b*f; f=b*f; c=f*c; f=d-f; g=f*g; f=16.0*f; c=g+c; e=f+e; f=0.00390625*c; g=l*f; g=m+g; h=j+g; g=h-g; g=b*g; g=b*g; c=g*c; g=d-g; f=g*f; c=f+c; f=8.0*g; g=0.0625*c; e=f+e; f=l*g; f=m+f; h=j+f; f=h-f; f=b*f; f=b*f; c=f*c; f=d-f; g=f*g; f=4.0*f; c=g+c; e=f+e; f=0.25*c; g=l*f; g=m+g; h=j+g; g=h-g; g=b*g; g=b*g; c=g*c; g=d-g; f=g*f; c=f+c; f=g+g; e=f+e; n=0.5; f=n*c; g=l*f; g=m+g; h=j+g; g=h-g; g=b*g; g=b*g; c=g*c; g=d-g; f=g*f; c=f+c; e=g+e; f=d-k; g=j+a; g-=a; g=m+g; g=b*g; g=b*g; g*=g; g=d-g; h=m+a; a=h-a; a=j+a; a=b*a; a=b*a; a*=a; a=d-a; a*=g; g=f*a; a=d-a; a=e*a; a+=g; e=l*c; e=m+e; g=j+e; e=g-e; e=b*e; e=b*e; g=n*c; c=e*c; e=d-e; e*=g; c=e+c; e=4.450147717014403e-308+c; g=j+e; g-=e; g=m+g; g=b*g; g=b*g; g*=g; g=d-g; h=m+e; e=h-e; e=j+e; e=b*e; e=b*e; e*=e; e=d-e; e*=g; g=e+e; d-=g; c=d*c; c=b*c; b*=c; c=-4503599627370496.0*f; c+=b; b=i*c; b=-0.4999999998835847+b; b=4503599627370497.0+b; d=-4503599627370497.0+b; b=2147483648.0*e; a=1048576.0*a; a=b+a; b=d+a; a=-4294967296.0*d; a+=c; low=a; high=b; return [low, high]; } function uint32s_as_double(low, high) { var a = low, b = high, c, d, e, f, g, h, i, j, k, l, m; b=9.5367431640625e-07*b; f=-0.4999999998835847; c=f+b; g=4503599627370497.0; c=g+c; e=-4503599627370497.0; c=e+c; d=b-c; c=0.00048828125*c; b=f+c; b=g+b; k=e+b; l=c-k; j=2.2250738585072014e-308; c=j+l; c-=l; i=4.49423283715579e+307; b=i*c; c=1.0; b=c-b; a=2.220446049250313e-16*a; h=-0.00048828125+l; a=d+a; d=b*h; d+=d; h=f+d; h=g+h; h=e+h; d-=h; b+=a; b=j*b; m=1.3407807929942597e+154; h=m*h; h=c+h; b=h*b; b*=h; d+=d; h=f+d; h=g+h; h=e+h; d-=h; h=m*h; h=c+h; b=h*b; d+=d; h=f+d; h=g+h; h=e+h; d-=h; h=1.157920892373162e+77*h; h=c+h; b=h*b; d+=d; h=f+d; h=g+h; h=e+h; d-=h; h=3.402823669209385e+38*h; h=c+h; b=h*b; d+=d; h=f+d; h=g+h; h=e+h; d-=h; h=1.8446744073709552e+19*h; h=c+h; b=h*b; d+=d; h=f+d; h=g+h; h=e+h; d-=h; h=4294967295.0*h; h=c+h; b=h*b; d+=d; h=f+d; h=g+h; h=e+h; d-=h; h=65535.0*h; h=c+h; b=h*b; d+=d; h=f+d; h=g+h; h=e+h; d-=h; h=255.0*h; h=c+h; b=h*b; d+=d; h=f+d; h=g+h; h=e+h; d-=h; h=15.0*h; h=c+h; b=h*b; d+=d; f+=d; f=g+f; e+=f; d-=e; e=3.0*e; e=c+e; b=e*b; d+=d; d=c+d; b=d*b; d=-0.99951171875+l; e=j+d; d=e-d; d=i*d; e=j+a; a=e-a; a=i*a; a=c-a; a=d*a; a=m*a; a=m*a; a-=a; a=b+a; b=k+k; b=c-b; a*=b; return a; }

(I’m mostly joking, but it would be pretty funny to find that code buried in a library someday, and it should be pretty easy to port to just about any language.)

Background: IEEE-754 Doubles

This aims to be a concise explanation of everything you need to know about doubles to understand the rest of the article. Skip it if you know it already.

... continue reading