I have my own, very fast cos function:
float sine(float x)
{
const float B = 4/pi;
const float C = -4/(pi*pi);
float y = B * x + C * x * abs(x);
// const float Q = 0.775;
const float P = 0.225;
y = P * (y * abs(y) - y) + y; // Q * y + P * y * abs(y)
return y;
}
float cosine(float x)
{
return sine(x + (pi / 2));
}
But now when I profile, I see that acos() is killing the processor. I don't need intense precision. What is a fast way to calculate acos(x) Thanks.
nVidia has some great resources that show how to approximate otherwise very expensive math functions, such as: acos asin atan2 etc etc...
These algorithms produce good results when speed of execution is more important (within reason) than precision. Here's their acos function:
And here are the results for when calculating acos(0.5):
That's pretty close! Depending on your required degree of precision, this might be a good option for you.
Another approach you could take is to use complex numbers. From de Moivre's formula,
ⅈx = cos(π/2*x) + ⅈ*sin(π/2*x)
Let θ = π/2*x. Then x = 2θ/π, so
How can you calculate powers of ⅈ without sin and cos? Start with a precomputed table for powers of 2:
To calculate arbitrary values of ⅈx, approximate the exponent as a binary fraction, and then multiply together the corresponding values from the table.
For example, to find sin and cos of 72° = 0.8π/2:
ⅈ0.8 ≈ ⅈ205/256 = ⅈ0b11001101 = ⅈ1/2 * ⅈ1/4 * ⅈ1/32 * ⅈ1/64 * ⅈ1/256
= 0.3078496400415349 + 0.9514350209690084*ⅈ
To find asin and acos, you can use this table with the Bisection Method:
For example, to find asin(0.6) (the smallest angle in a 3-4-5 triangle):
Each time you increase x, multiply by the corresponding power of ⅈ. Each time you decrease x, divide by the corresponding power of ⅈ.
If we stop here, we obtain acos(0.6) ≈ 13/32*π/2 = 0.6381360077604268 (The "exact" value is 0.6435011087932844.)
The accuracy, of course, depends on the number of iterations. For a quick-and-dirty approximation, use 10 iterations. For "intense precision", use 50-60 iterations.
You can approximate the inverse cosine with a polynomial as suggested by dan04, but a polynomial is a pretty bad approximation near -1 and 1 where the derivative of the inverse cosine goes to infinity. When you increase the degree of the polynomial you hit diminishing returns quickly, and it is still hard to get a good approximation around the endpoints. A rational function (the quotient of two polynomials) can give a much better approximation in this case.
where
has a maximum absolute error of 0.017 radians (0.96 degrees) on the interval (-1, 1). Here is a plot (the inverse cosine in black, cubic polynomial approximation in red, the above function in blue) for comparison:
The coefficients above have been chosen to minimise the maximum absolute error over the entire domain. If you are willing to allow a larger error at the endpoints, the error on the interval (-0.98, 0.98) can be made much smaller. A numerator of degree 5 and a denominator of degree 2 is about as fast as the above function, but slightly less accurate. At the expense of performance you can increase accuracy by using higher degree polynomials.
A note about performance: computing the two polynomials is still very cheap, and you can use fused multiply-add instructions. The division is not so bad, because you can use the hardware reciprocal approximation and a multiply. The error in the reciprocal approximation is negligible in comparison with the error in the acos approximation. On a 2.6 GHz Skylake i7, this approximation can do about 8 inverse cosines every 6 cycles using AVX. (That is throughput, the latency is longer than 6 cycles.)
Here is a great website with many options: https://www.ecse.rpi.edu/Homepages/wrf/Research/Short_Notes/arcsin/onlyelem.html
Personally I went the Chebyshev-Pade quotient approximation with with the following code:
Unfortunately I do not have enough reputation to comment. Here is a small modification of Nvidia's function, that deals with the fact that numbers that should be <= 1 while preserving performance as much as possible.
It may be important since rounding errors can lead number that should be 1.0 to be (oh so slightly) larger than 1.0.
Got spare memory? A lookup table (with interpolation, if required) is gonna be fastest.