We have recently released a set of optimized assembly-language routines for basic floating-point arithmetic, in Arm Optimized Routines, under an open-source license.
These functions perform the same operations as hardware floating point instructions, for example addition, multiplication, and division. However, they are implemented using only integer 32-bit Arm instructions, for Arm CPUs without hardware floating point. Existing open-source libraries such as libgcc and compiler-rt offer similar functions. Our optimized versions were previously part of the Arm Compiler for Embedded toolchain, which has reached end of life. We are now making them available as open source.
Preparing these functions for publication mainly involved polishing the code and rewriting the comments to improve clarity. However, one of them was much more interesting: polishing up the double-precision division function, “ddiv”, as we call it. The task turned into an adventure in formal verification, and ended up finding a bug in the original version.
In this two-part blog, I’ll tell the story, and share my learning experiences using Gappa and the Rocq prover.
What is “ddiv” and how does it work?
The “ddiv” function is implements double-precision floating-point division in accordance with the IEEE 754 standard. It accepts two (64-bit) floating-point numbers as input and gives their quotient as output. Since it only uses integer arithmetic, it treats the inputs and outputs as 64-bit integers. Each integer is divided into three fields: sign, exponent, and mantissa.
Floating-point functions require handling many special cases such as NaNs, infinities, denormalized numbers, overflow, underflow, and signed zeroes. Division also requires handling dividing by zero. However, this discussion focuses on ordinary cases, where nothing strange happens.
In these ordinary cases, the signs and exponents of the input number are easy to deal with. The primary challenge is the mantissas. Each input mantissa is effectively a 53-bit binary number, made by putting a 1 bit on the front of the 52 bits stored in the input value. Given two such mantissas, n (numerator) and d (denominator), the goal is to compute their quotient and round it to the nearest 53-bit binary number, efficiently.
There are many ways to approach this challenge. The most obvious is the binary analogue of ordinary pencil-and-paper long division: loop round 53 times, generating one output bit every time. However, we can go significantly faster than that.
The approach we took in “ddiv” is to use an approximate method. We start by calculating an approximation to 1 / d (the reciprocal of the denominator). Then we multiply that by n, to get an approximation to the quotient n / d. This strategy can be implemented a lot faster than the bit-by-bit loop, because the CPU’s multiplication instruction does a lot more of the work for us.
The approximations are more precise than the output needs to be. In the version of the code I started with, the approximate reciprocal occupies 61 bits, rather than 53. So, 8 extra bits are generated, which helps us to know which way to round the result, but will be discarded in the final output. However, those extra bits are not all accurate. The technique we use to calculate an approximate reciprocal has an error.
IEEE 754 requires that the result returned from the division operation is the closest possible floating-point number to the true mathematical result. If you round any number in the wrong direction, then you are in violation of the standard. Therefore, we need to be careful about that approximation error. Is it possible that it makes us round a number the wrong way?
To illustrate this, here is a diagram of a small piece of the number line:
The labels show numbers that are exactly representable in double-precision floating point. In the center is an arbitrary number x. The values to the left and right change by the “machine epsilon” ε at each step. This is the minimum separation between floating-point numbers with a particular exponent.
The dotted lines, located halfway between representable numbers, show the rounding boundaries. In the default IEEE 754 mode, numbers are rounded to the nearest output value. This may be either rounding up or down. A number just left of a dotted line rounds down, and one just right rounds up.
The green interval on the left shows one possible output from ddiv’s main division code. The actual value returned is the circle at the left end of the interval. The code uses an approximate method. but it only errs in one direction, so the true quotient is never further left than the green blob. It might be further right, anywhere within the marked interval.
Fortunately, it does not matter. The whole interval lies between the same pair of dotted lines, so any value in that interval will round to the same output value, namely x − ε. Therefore, the uncertainty in our approximate quotient does not affect the final output.
The red line on the right shows what happens if we are not so lucky. In this case, the interval crosses a rounding boundary. The smaller values on the left side of the dotted line round down to x. The larger values on the right side round up to x + ε. The correct output value is uncertain. Our fast but approximate calculation has not managed to narrow down to just one answer. In this case, ddiv runs correction code that performs additional calculation to determine the correct output.
This article does not cover the correction code in detail. The key point is that correction takes significant effort, so it should be avoided where possible.
We calculate an upper bound on the error in ddiv’s approximate method. That tells us the width of one of those intervals of uncertainty, a range of numbers where the true quotient must lie. In most cases, this means we can skip the slower correction code. We simply round and return the approximate value.
ddiv stands or falls on its error bound
The upper bound on the approximation error is central to the ddiv implementation strategy.
The narrower we make that interval of uncertainty, the faster the function will run on average. Fewer cases require the slower correction code and more can use the “fast path” which rounds and returns the approximate value.
This is a rare case where improved mathematical analysis improves performance without changing the algorithm. By tightening the error bound of the existing algorithm, we only need to tweak a couple of constants in the code. As a result, the entire function becomes faster.
An error bound that is too large reduces performance. An error bound that is too small can cause an incorrect answer to be returned. The error bound must be as tight as possible, but no tighter.
In single precision floating point, the problem is simpler. There are only 223 possible values of the denominator d. You can just iterate over all of them and find out how inaccurate your approximation to 1 / d can be. In double precision, there are 252 possible values, and that is too many to loop over. We must rely on mathematical proof, rather than brute-force computation.
When this ddiv function was first written, I calculated the error bound by hand and recorded all my working in a giant comment in the code. A proof in that form requires an expert mathematician to check that it is correct. While I was polishing up the code and the comments, this seemed like a good moment to try to improve the situation.
What is the alternative to a proof that requires a human to check it? A proof checked by a computer. By formalizing the original error proof or generating a replacement one that is formalized already, we can confirm it was correct by re-running the automated checker.
Introducing Gappa
The tool I selected for this task is Gappa. I had heard of it before, but this was the first time I had tried using it.
Gappa is not a general-purpose theorem prover. It is specialized for numerical analysis. Gappa finds or verifies an interval of possible values for a numerical calculation, based on similar intervals for its input values.
As a quick introduction, here is a simple example of Gappa input:
y = x * x; { x in [-2, +2] -> y in ? }
The first line defines a calculation. We have some unspecified input x, and we compute a value y = x2. The second line defines the question we are asking Gappa about this calculation. If x is in the interval [−2, +2], what interval can you prove y is in?
Gappa reports this:
Results: y in [0, 4]
As expected, a squared real number is never negative. Therefore, x2 cannot be less than 0. If the absolute value of x is at most 2, then x2 is at most 4.
However, Gappa does not do magic all by itself. There is an optional third component of the input file, coming after the calculation and the question. You can also provide hints to Gappa about how it should rearrange its algebraic expressions. For example, here is a second input file where Gappa needs some help:
y = x * (10 - x); { x in [4,6] -> y in ? }
In this case, Gappa will report that y is in the range [16, 36]. Why? It knows that x is between 4 and 6, and therefore 10 − x is also between 4 and 6. Multiplying two values of at least 4 gives a product of at least 42 = 16. If they are both at most 6, then the product is at most 62 = 36.
That is all true, of course, but it is not the full picture. The output is close to 16 only if both the values being multiplied, x and 10 − x, are close to 4. The result is close to 36 only if both those values are close to 6. However, they cannot both do that at once, because x and 10 − x are not independent.
When one value is small, the other one is large. They cannot both be close to 4 or both close to 6. Therefore, Gappa’s lower bound of 16 and upper bound of 36 cannot actually be attained.
The value of y can therefore be bounded more tightly. If you plot the function on a graph, you can see what is really going on:
The value of x(10 − x) is guaranteed to fall in the much smaller interval [24, 25]. It hits its maximum value at x = 5, where x(10 − x) = 52 = 25. On both sides, it decreases the further away x gets from 5, so that the smallest value it takes is when x is as far away as possible from 5: when x is either 4 or 6, so that the calculation computes 4 × 6 = 24.
Gappa can prove this result, but it needs help. We can provide it with this input file:
y = x * (10 - x); { x in [4,6] -> y in ? } x * (10 - x) -> 25 - (x - 5)*(x - 5);
This is identical to the previous input, except that I have added a third line, after the braced question section. That third line is a hint. It tells Gappa that when it sees the algebraic expression x(10 − x), it should try rearranging it into 25 − (x − 5)2.
Gappa first checks that the two expressions are equivalent by multiplying them out. Both come to 10x − x2. Once convinced our hint is true, Gappa applies it, and then it can find a better bound. It reasons that if x is in the range [4, 6], then x − 5 is in the range [−1, +1], so (x − 5)2 is in the range [0, 1], and therefore 25 minus that is in the range [24, 25]:
Results: y in [24, 25]
These are simplified examples. The normal use of Gappa is to calculate a reliable bound on the error in an approximate calculation. The ‘calculation’ section will normally perform two separate calculations. One delivers the true mathematical value of the result you want, and the other delivers the approximate result from your particular algorithm. Then you ask Gappa to bound the difference between those two values.
In a real-world case, your approximate algorithm is probably written in floating point, fixed point, or integers. As a result, intermediate values are not perfectly accurate real numbers. At many steps, the answers will be implicitly rounded to the nearest number that fits in whatever format you are using. Gappa’s input language comes with a range of ways to specify that rounding explicitly. This allows Gappa to account of rounding errors introduced at each step when calculating the maximum overall error.
A natural question is whether this replaces one problem of formal verification with another. It is useful to apply Gappa to check the calculation in your function, but what happens if Gappa itself has a bug? Gappa performs complicated analysis, particularly for realistic calculations which have implicit roundings between most operations.
Gappa has a solution to this concern. Gappa can run in a mode where it calculates bounds on the query value and produces a formal proof of those bounds. The proof is generated in a format suitable for the Rocq theorem prover, formerly known as ‘Coq’, renamed in 2025. If Gappa’s output is accepted by Rocq, you can be confident it contains a complete and correct proof.
You do not have to trust Gappa itself to have got all the details of its numerical calculations right. You only need to trust the Rocq verifier which checks every step of Gappa’s working. Theorem proving is complex, but once a proof is expanded into simple enough steps, proof checking is much simpler. This approach minimizes the amount of code that must be trusted.
Translating ddiv’s mantissa division into Gappa
Once I selected Gappa to verify a bound on ddiv’s calculation, the first step is to explain to Gappa what ddiv’s calculation is.
I mentioned earlier that the basic strategy of ddiv’s calculation is to approximate the reciprocal 1 / d, and then multiply the result by n. More specifically, the reciprocal approximation is calculated using the standard Newton-Raphson iteration from calculus textbooks:
x → x − f(x) / f'(x)
Here, f is a function that is zero at the value you are trying to find, and f' is its derivative.
One difficulty with this technique for division is that the formula itself involves a division. It is easy to think of a function f that has a zero at 1 / d, but several of the obvious choices (such as f(x) = dx − 1) produce an iteration formula that requires you to divide by something. If we already knew how to divide, we would not be trying to write a division routine in the first place.
The secret is to use this function:
f(x) = 1/x − d
When you apply the Newton-Raphson formula to that function, the division by f' turns into a multiplication by x2. This cancels out the division by x in the numerator. You are left with an iteration involving only multiplications and subtractions:
x → x(2 − dx)
Like any Newton-Raphson iteration, this formula doubles the number of accurate bits in every iteration. To see that without calculus, suppose that x is already close to d, so that dx is close to 1, say 1 − ε. Then d times the next approximation is dx(2 − dx) = (1 − ε)(1 + ε) = 1 − ε2. The error is the square of the previous error.
Our ddiv function begins with an 8-bit approximation. In three iterations, it increases it to 16, 32 and then (very nearly) 64 bits of precision. Each of these iterations is done in handwritten Arm machine code. They are all different from each other, because when you are converting an 8-bit value to a 16-bit one, it is not necessary to compute 64 bits of any intermediate value.
For example, here is the machine code that computes the first iteration, turning an 8-bit approximation into 16 bits:
LSR r5, r3, #16 MUL r7, r6, r5 RSB r7, r7, #1<<24 MUL r7, r6, r7 LSR r7, r7, #15
The three middle instructions, two MUL and an intervening RSB, are the main part of the formula. They multiply x by d, then subtract from 2, then multiply by x again. As all of this is scaled up to fit in integer registers, “subtract from 2” becomes “subtract from 224”.
The initial LSR prepares the value of d we will be using, by extracting just its top 16 bits. That is all we really need to make the output of this iteration accurate to 16 bits. By not using any more bits than we need, we avoid multiplications overflowing.
The final LSR discards the unwanted extra bits in the output from the second MUL. Otherwise, we would have a 32-bit value with only 16 of the bits being worthwhile.
To model those five instructions in Gappa, I ended up writing this, where recip08 represents the 8-bit approximation the iteration starts from, and recip16 is the 16-bit approximation it outputs:
recip16 = floor((recip08 * (0x1p24 - recip08 * floor(d / 0x1p48))/ 0x1p15);
The output looks different, both in syntax and semantics. The Gappa version of the code is thinking of d as a single 64-bit value. In the machine code it has to be split between two 32-bit Arm registers. Therefore, to extract its top 16 bits the Gappa code divides by 248 where the machine code divides by 216.
Each right shift operation is written as a division by a power of 2 followed by a call to the floor function (“round down to the next smaller integer”). That is how the machine code right-shift works. If I had not written floor explicitly, Gappa would assume that each of those divisions gave a fractional result, with all the low-order bits still there. The error analysis would then be based on that wrong assumption.
Making Gappa simulate a lookup table
Earlier, I stated that ddiv runs three Newton-Raphson iterations, starting from an 8-bit approximation to 1 / d. Where did that first 8-bit approximation come from?
In the ddiv code, it comes from a lookup table. We use the top 8 bits of d itself (including the leading bit which is always 1) as an index into a 128-entry lookup table. Each entry provides the best approximation to the reciprocal of each of those possible 8-bit values.
Gappa’s input language is good at describing arithmetic, and rounding, but not lookup tables. So how do we tell Gappa about this part of the algorithm?
I had hoped to achieve this by writing a complicated expression in the ‘question’ section of the input. When you specify the ranges of the input values, Gappa lets you do it with a Boolean expression, using the unusual ASCII operators /\ and \/ for ‘AND’ and ‘OR’ respectively (mimicking the mathematical logic operators ∧ and ∨). I thought perhaps I could treat my variable recip08 as one of the inputs to the whole calculation, and then write a large boolean expression with one clause per range of the input:
{ (d in [0x8000000000000000,0x80FFFFFFFFFFFFFF] /\ recip08 = 0xFF) \/ (d in [0x8100000000000000,0x81FFFFFFFFFFFFFF] /\ recip08 = 0xFD) \/ (d in [0x8200000000000000,0x82FFFFFFFFFFFFFF] /\ recip08 = 0xFB) \/ ... (d in [0xFD00000000000000,0xFDFFFFFFFFFFFFFF] /\ recip08 = 0x81) \/ (d in [0xFE00000000000000,0xFEFFFFFFFFFFFFFF] /\ recip08 = 0x81) \/ (d in [0xFF00000000000000,0xFFFFFFFFFFFFFFFF] /\ recip08 = 0x80) -> ... }
Which reads as: “Either d is in this range and recip08 has this value, or d is in that range and recip08 has that value, or” and so on for 128 cases.
This was too much for Gappa. It could not do anything useful with an expression this complicated. It thought for a long time and reported that it could not find a bound on the division error at all.
I resorted to Plan B: separate each of these cases into its own run of Gappa and just run Gappa 128 times. Each run is given a single range of d covering the catchment area of one entry in the lookup table, and a single fixed value of recip08 giving the appropriate value from the table. Then we record the error bound Gappa gave from each of those runs and find the worst case out of all of them.
Giving Gappa hints about the algebra
Even with the input ranges narrowed down to a single lookup table entry, Gappa did not output any useful bound on the error. The expectation was that the final 64-bit integer value of the scaled-up approximate quotient should differ from the true value by at most a few tens of units. Instead Gappa reported that the error could be anything in the range [−256, +256].
This does not mean I made a mistake in my Gappa input file. The cause was the same reason that my introductory example earlier derived the range [16, 36] for a value that was bounded in the much smaller range [24, 25]. In this case the same issue occurred on an even more extreme scale. The problem is that Gappa needed help rearranging the algebraic expressions into a form that would let it see why the error was small.
There is no easy way to know what you should do in this kind of situation. You can tell there is a problem in the first place because you get an absurdly large error interval as output. You can narrow down to a specific part of the calculation by asking Gappa to analyze the error in a particular intermediate value instead of the final output. Working back, you find the first intermediate value where it had trouble.
However, I could not find any way to get it to output the actual algebraic expression it was considering at the point where it failed. If I had known that, it would have been easy to suggest how it should rewrite that expression. Instead, I had to use a lot of trial and error, until I found a combination which worked.
The biggest problem is that Gappa did not recognize that the Newton-Raphson iteration converges quadratically. That is, the error after an iteration is proportional to the square of the error before, and therefore much smaller, if the initial error was already small. Here I had a piece of luck, because Newton-Raphson based division is one of the examples in Gappa’s own manual. There was already a worked example of how to give Gappa the right kind of hint: you tell it to rewrite:
x(2 − dx) − 1/d → −d(x − 1/d)2
The left-hand expression represents the absolute error in the output approximation, the difference between it and the true value of 1 / d). The rewritten expression includes the absolute error in the input approximation, squared. Gappa can then use the bound it has on the input error x − 1 / d to compute a usefully small bound on the output error.
It is one thing to see a worked example applying this to a simple use of Newton-Raphson division. It is another to get it to work in this much more complicated case involving lots of intermediate roundings and scalings. After some struggle, I found I could write a hint like this:
(recip08 * (0x1p24 - recip08 * (d/0x1p48))) / 0x1p15 - Recip16 -> (-(recip08 - Recip08) * (recip08 - Recip08) * d / 0x1p63) { d <> 0 };
In this case, the starting expression on the first line looks very like the calculation I showed earlier, without the floor operations. It describes a mathematically simpler calculation. Gappa works out that the expression we are really computing, with those floor s, differs by only a small amount from the one shown in the hint. It takes account of that extra error by itself.
The capitalized variables Recip08 and Recip16 both represent the true mathematical value of 1 / d. However, each one is scaled by a different power of 2 to match what they look like in the machine code.
The hint must declare that d is not zero, to prevent Gappa getting confused by any attempt to divide by zero. It seemed odd to have to do that when I had already specified a range of values for d that did not include zero. However, the manual warned to do it anyway, and it was right.
There was one other problem. In the very last part of the calculation, I was surprised to have to give Gappa another hint. In ddiv’s final multiplication of the approximate value of 1 / d by n, some low-order terms are discarded to save a few instructions. The output is q + ε, where q is the accurate version of that product and ε is the error introduced by discarding terms. When I asked Gappa to tell me the difference between that and the true quotient Q = n/d, it could not do it until provided with a hint. The hint required rearranging the expression to take that error ε out from between the two values:
(q + ε) − Q → (q − Q) + ε
I was not surprised that Gappa needed help with Newton-Raphson. It is not obvious to humans either (we must be taught it). However, it was a surprise to have to help it with something this simple.
To be continued
In part 2, I talk about one of the most difficult parts of the problem. After writing this Gappa description of the calculation, how can we be sure that it really matches the version in the machine code?
I also describe the way this verification exercise revealed a bug in the original code, and the way it made it easy to fix.
Part 2