I don't normally do follow-ups and never this quick. After posting that last article, it was fun to read the comments on Reddit and Hacker News as they rolled in. I even found other discussions. I couldn't help wonder, "Could I have made it even more performant?". After heading home I decided to take another look.
There was.
Gotta Go Fast
Look at the implementation of the Cg asin() approximation;
double asin_cg(const double x) { // Original Minimax coefficients constexpr double a0 = 1.5707288; constexpr double a1 = -0.2121144; constexpr double a2 = 0.0742610; constexpr double a3 = -0.0187293; // Strip sign const double abs_x = abs(x); // Evaluate polynomial using Horner's method double p = a3 * abs_x + a2; p = p * abs_x + a1; p = p * abs_x + a0; // Apply sqrt term and pi/2 offset const auto x_diff = sqrt(1.0 - abs_x); const double result = (Pi / 2.0) - (x_diff * p); // Restore sign natively return copysign(result, x); }
Does something about how p is computed look a little curious? It can be rewritten to be fully const . While it's not a requirement, it's generally something you should strive for. This was missed in the first pass.
const double p = ((a3 * abs_x + a2) * abs_x + a1) * abs_x + a0;
From here we can do a polynomial expansion and factoring. This is where the magic happens. Showing the work step by step:
p = ((a3 * abs_x + a2) * abs_x + a1) * abs_x + a0 p = (a3 * abs_x * abs_x + a2 * abs_x + a1) * abs_x + a0 p = (a3 * abs_x^2 + a2 * abs_x + a1) * abs_x + a0 p = a3 * abs_x^3 + a2 * abs_x^2 + a1 * abs_x + a0 p = (a3 * abs_x^3 + a2 * abs_x^2) + (a1 * abs_x + a0) p = (a3 * abs_x + a2) * abs_x^2 + (a1 * abs_x + a0)
Taking that last term for p , we have this in code:
... continue reading