Skip to content
Tech News
← Back to articles

Exact UNORM8 to Float

read original more articles

GPUs support UNORM formats that represent a number inside [0,1] as an 8-bit unsigned integer. In exact arithmetic, the conversion to a floating-point number is straightforward: take the integer and divide it by 255. 8-bit integers are for sure machine numbers (exactly represented) in float32 and so is 255, so if you’re willing to do a “proper” divide, that’s the end of it; both inputs are exact, so the result of the division is the same as the result of the computation in exact arithmetic rounded to the nearest float32 (as per active rounding mode anyway), which is the best we can hope for.

The D3D11.3 functional spec chickened out here a bit (granted, this verbiage was already in the D3D10 version as I recall) and added the disclaimer that

Requiring exact (1/2 ULP) conversion precision is acknowledged to be too expensive.

For what it’s worth, I had reason to test this a while back (as part of my GPU BCn decoding experiments) and at the time anyway, all GPUs I got my hands on for testing seemed to do exact conversions anyway. It turns out that doing the conversion exactly is not particularly expensive in HW (that might be a post for another day) and certainly doesn’t require anything like a “real” divider, which usually does not exist in GPUs; correctly rounded float32 divisions, when requested, are usually done as a lengthy instruction sequence (plus possibly an even lengthier fallback handler in rare cases).

A conversion that is close enough for essentially all practical purposes is to simply multiply by the float32 reciprocal, i.e. 1.f / 255.f , instead. This is not the same as the exact calculation, but considering you’re quantizing data into an uint8 format to begin with, is almost certainly way more precision than you need.

What if, as a matter of principle, we want an exact solution, though?

Option 1: work in double (float64)

Not much to say here. Converting the 8-bit value x to double, multiplying by 1. / 255. , then converting the result to float32 happens to give the same results as the exact calc (i.e. dividing x / 255.f ). This is trivial to verify by exhaustive testing. It’s just 256 possible inputs.

Option 2: but I don’t want to use doubles either!

Picky, are we? OK, fine, let’s get to the meat of this post. What if we’d like an exact conversion, would prefer to not use a divide, and would also prefer to avoid using doubles?

... continue reading