Vectorized Trig functions in C?

2020-06-16 09:39发布

问题:

I'm looking to calculate highly parallelized trig functions (in block of like 1024), and I'd like to take advantage of at least some of the parallelism that modern architectures have.

When I compile a block

for(int i=0; i<SIZE; i++) {
   arr[i]=sin((float)i/1024);
}

GCC won't vectorize it, and says

not vectorized: relevant stmt not supported: D.3068_39 = __builtin_sinf (D.3069_38);

Which makes sense to me. However, I'm wondering if there's a library to do parallel trig computations.

With just a simple taylor series up the 11th order, GCC will vectorize all the loops, and I'm getting speeds over twice as fast as a naive sin loop (with bit-exact answers, or with 9th order series, only a single bit off for the last two out of 1600 values, for a >3x speedup). I'm sure someone has encountered a problem like this before, but when I google, I find no mentions of any libraries or the like.

A. Is there something existing already?
B. If not, advice for optimizing parallel trig functions?

EDIT: I found the following library called "SLEEF": http://shibatch.sourceforge.net/ which is described in this paper and uses SIMD instructions to calculate several elementary functions. It uses SSE and AVX specific code, but I don't think it will be hard to turn it into standard C loops.

回答1:

Since you said you were using GCC it looks like there are some options:

  • http://gruntthepeon.free.fr/ssemath/
    • This uses SSE and SSE2 instructions to implement it.
  • http://www.gamasutra.com/view/feature/4248/designing_fast_crossplatform_simd_.php
    • This has an alternate implementation. Some of the comments are pretty good.

That said, I'd probably look into GPGPU for a solution. Maybe writing it in CUDA or OpenCL (If I remember correctly CUDA supports the sine function). Here are some libraries that look like they might make it easier.

  • https://code.google.com/p/slmath/
  • https://code.google.com/p/thrust/


回答2:

Since you are looking to calculate harmonics here, I have some code that addressed a similar problem. It is vectorized already and faster than anything else I have found. As a side benefit, you get the cosine for free.



回答3:

What platform are you using? Many libraries of this sort already exist:

  • Intel's provides the Vector Math Library (VML) with icc.
  • Apple provides the vForce library as part of the Accelerate framework.
  • HP provides their own Vector Math Library for Itanium (and may other architectures, too).
  • Sun provided libmvec with their compiler tools.
  • ...


回答4:

Instead of the taylor series, I would look at the algorithms fdlibm uses. They should get you as much precision with fewer steps.



回答5:

My answer was to create my own library to do exactly this called vectrig: https://github.com/jeremysalwen/vectrig