Trouble generating prime numbers with CUDA

I am just getting stared with cuda, and after going over the vector sum tutorials here I thought I would try something from scratch to really get my legs under me.

That said I don't know if the trouble here is a simple fix or a whole myriad of issues.

The plain English description of my code is as follows:

First there is a counterClass that has members num and count. By setting count = 0 when ever count equals num this counter class will keep track of the remainder when dividing by num as we iterate up through the integers.

I have 2 functions that I want to run in parallel. The first called count which will increment all my counters (in parallel), and the second which will check if any of the counters read 0 (in parallel) If a counter reads 0 that num divides n evenly meaning that n isn't prime.

While I would like my code to only print prime numbers, it prints all the numbers...

Here's the code:

#include <stdio.h>
#include <stdlib.h>

typedef struct{
    int num;
    int count;
} counterClass;

counterClass new_counterClass(counterClass aCounter, int by, int count){
    aCounter.num = by;
    aCounter.count = count%by;
    return aCounter;
}

__global__ void count(counterClass *Counters){
    int idx = threadIdx.x+blockDim.x*blockIdx.x;
    Counters[idx].count+=1;
    if(Counters[idx].count == Counters[idx].num){
        Counters[idx].count = 0;
    }
    __syncthreads();
}

__global__ void check(counterClass *Counters, bool *result){
    int idx = threadIdx.x+blockDim.x*blockIdx.x;
    if (Counters[idx].count == 0){
        *result = false;
    }
    __syncthreads();
}

int main(){
    int tPrimes = 5;    // Total Primes to Find
    int nPrimes = 1;    // Number of Primes Found
    bool  *d_result, h_result=true;
    counterClass *h_counters =(counterClass *)malloc(tPrimes*sizeof(counterClass));
    h_counters[0]=new_counterClass(h_counters[0], 2 , 0);
    counterClass *d_counters;
    int n = 2;
    cudaMalloc((void **)&d_counters, tPrimes*sizeof(counterClass));
    cudaMalloc((void **)&d_result, sizeof(bool));
    cudaMemcpy(d_counters, h_counters, tPrimes*sizeof(counterClass), cudaMemcpyHostToDevice);
    while(nPrimes<tPrimes){
        h_result=true;
        cudaMemcpy(d_result, &h_result, sizeof(bool), cudaMemcpyHostToDevice);
        n+=1;
        count<<<1,nPrimes>>>(d_counters);
        check<<<1,nPrimes>>>(d_counters,d_result);
        cudaMemcpy(&h_result, d_result, sizeof(bool), cudaMemcpyDeviceToHost);
        if(h_result){
            printf("%d\n", n);
            cudaMemcpy(h_counters, d_counters, tPrimes*sizeof(counterClass), cudaMemcpyDeviceToHost);
            h_counters[nPrimes]=new_counterClass(h_counters[nPrimes], n , 0);
            nPrimes += 1;
            cudaMemcpy(d_counters, h_counters, tPrimes*sizeof(counterClass), cudaMemcpyHostToDevice);
        }
    }
}

There are some similar questions CUDA - Sieve of Eratosthenes division into parts and good examples posted as questions by people seeking to improve their code , CUDA Primes Generation & Low performance in CUDA prime number generator But reading through these hasn't helped me figure out what is going wrong in my code!

Any advice on how to more effectively debug while working with CUDA would be appreciated and if you can point out what I am doing wrong (because I know it's not the computers fault) you will have my respect forever.

edit:

apparently this issue is only happening for me so perhaps it's the way I'm running my code...

$ nvcc parraPrimes.cu -o primes
$ ./primes
3
4
5
6

additionally using cuda-memCheck as recommended:

$ cuda-memcheck ./primes
========= CUDA-MEMCHECK
3
4
5
6
========= ERROR SUMMARY: 0 errors

The output from dmesg |grep NVRM is as follows::

[    3.480443] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  304.131  Sun Nov  8 21:43:33 PST 2015

Nvidia-smi is not installed on my system.

标签： parallel-processing cuda primes

1条回答

Trouble generating prime numbers with CUDA

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间