Finding lists of prime numbers with SIMD - SSE/AVX

2019-02-14 23:57发布

问题:

I'm curious if anyone has advice on how to use SIMD to find lists of prime numbers. Particularly I'm interested how to do this with SSE/AVX.

The two algorithms I have been looking at are trial division and the Sieve of Eratosthenes. I have managed to find a way to use SSE with trial division. I found a faster way to to division which works well for a vector/scalar "Division by Invariant Integers Using Multiplication"http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.2556 Each time I find a prime I form the results to do a fast division and save them. Then the next time I do the division it goes much quicker. Doing this I get a speedup of about 3x (out of 4x). With AVX2 maybe it would be faster as well.

However, trial division is much slower than the Sieve of Eratosthenes and I can't think of any way to use SIMD with the Sieve of Eratosthenes except with some kind of scatter instruction with not even AVX2 has yet. Would a scatter instruction helpful? Is anyone aware of this being used on GPUs for the Sieve of Eratosthenes?

Here is the fastest version of the Sieve of Eratosthenes using OpenMP I know of. Any ideas how to speed this up with SSE/AVX? http://create.stephan-brumme.com/eratosthenes/

Here is the function I use to determine if a number is prime. I operate on eight primes at once (will actually it's 4 at a time done twice without AVX2). I'm using Agner Fog's vectorclass. The idea is that for large values there are unlikely to be eight primes in sequence. If I find a prime among the eight I have to loop over the results in sequence.

inline int is_prime_vec8(Vec8ui num, Divisor_ui_s *buffer, int size) {
    Divisor_ui div = Divisor_ui(buffer[0].m, buffer[0].s1, buffer[0].s2);
    int val = buffer[0].d; 
    Vec8ui cmp = -1;

    for(int i=0; (val*val)<=num[7]; i++) {
        Divisor_ui div = Divisor_ui(buffer[i].m, buffer[i].s1, buffer[i].s2);
        val = buffer[i].d;
        Vec8ui q = num/div; 
        cmp &= (q*val != num);
        int cnt = _mm_movemask_epi8(cmp.get_low()) || _mm_movemask_epi8(cmp.get_high());
        if(cnt == 0) {
            return size;  //0 primes were found
        }
    }
    num &= cmp;  //at least 1 out of 8 values were found to be prime
    int tmp[8];
    num.store(tmp);

    for(int i=0; i<8; i++) {
        if(tmp[i]) {
            set_ui(tmp[i], &buffer[size++]);
        }
    }       
    return size;
}

Here is where I setup to the eight prime candidates. I skip multiples of 2, 3, and 5 doing this.

int find_primes_vec8(Divisor_ui_s *buffer, const int nmax) {
    int start[] = {2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47};
    int size = sizeof(start)/4;

    for(int i=0; i<size; i++) {
        set_ui(start[i], &buffer[i]);
    }

    Vec8ui iv(49, 53, 59, 61, 67, 71, 73, 77);
    size-=3;
    for(int i=49; i<nmax; i+=30) {
        if((i-1)%100000==0) printf("i %d, %f %%\n", i, 100.f*i/(nmax/16));
        size = is_prime_vec8(iv, &buffer[3], size);
        iv += 30;
    }   
    size+=3;

    return size;
}