I'm curious if anyone has advice on how to use SIMD to find lists of prime numbers. Particularly I'm interested how to do this with SSE/AVX.
The two algorithms I have been looking at are trial division and the Sieve of Eratosthenes. I have managed to find a way to use SSE with trial division. I found a faster way to to division which works well for a vector/scalar "Division by Invariant Integers Using Multiplication"http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.2556 Each time I find a prime I form the results to do a fast division and save them. Then the next time I do the division it goes much quicker. Doing this I get a speedup of about 3x (out of 4x). With AVX2 maybe it would be faster as well.
However, trial division is much slower than the Sieve of Eratosthenes and I can't think of any way to use SIMD with the Sieve of Eratosthenes except with some kind of scatter instruction with not even AVX2 has yet. Would a scatter instruction helpful? Is anyone aware of this being used on GPUs for the Sieve of Eratosthenes?
Here is the fastest version of the Sieve of Eratosthenes using OpenMP I know of. Any ideas how to speed this up with SSE/AVX? http://create.stephan-brumme.com/eratosthenes/
Here is the function I use to determine if a number is prime. I operate on eight primes at once (will actually it's 4 at a time done twice without AVX2). I'm using Agner Fog's vectorclass. The idea is that for large values there are unlikely to be eight primes in sequence. If I find a prime among the eight I have to loop over the results in sequence.
inline int is_prime_vec8(Vec8ui num, Divisor_ui_s *buffer, int size) {
Divisor_ui div = Divisor_ui(buffer[0].m, buffer[0].s1, buffer[0].s2);
int val = buffer[0].d;
Vec8ui cmp = -1;
for(int i=0; (val*val)<=num[7]; i++) {
Divisor_ui div = Divisor_ui(buffer[i].m, buffer[i].s1, buffer[i].s2);
val = buffer[i].d;
Vec8ui q = num/div;
cmp &= (q*val != num);
int cnt = _mm_movemask_epi8(cmp.get_low()) || _mm_movemask_epi8(cmp.get_high());
if(cnt == 0) {
return size; //0 primes were found
}
}
num &= cmp; //at least 1 out of 8 values were found to be prime
int tmp[8];
num.store(tmp);
for(int i=0; i<8; i++) {
if(tmp[i]) {
set_ui(tmp[i], &buffer[size++]);
}
}
return size;
}
Here is where I setup to the eight prime candidates. I skip multiples of 2, 3, and 5 doing this.
int find_primes_vec8(Divisor_ui_s *buffer, const int nmax) {
int start[] = {2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47};
int size = sizeof(start)/4;
for(int i=0; i<size; i++) {
set_ui(start[i], &buffer[i]);
}
Vec8ui iv(49, 53, 59, 61, 67, 71, 73, 77);
size-=3;
for(int i=49; i<nmax; i+=30) {
if((i-1)%100000==0) printf("i %d, %f %%\n", i, 100.f*i/(nmax/16));
size = is_prime_vec8(iv, &buffer[3], size);
iv += 30;
}
size+=3;
return size;
}