I am just getting stared with cuda, and after going over the vector sum tutorials here I thought I would try something from scratch to really get my legs under me.
That said I don't know if the trouble here is a simple fix or a whole myriad of issues.
The plain English description of my code is as follows:
First there is a counterClass that has members num and count. By setting count = 0 when ever count equals num this counter class will keep track of the remainder when dividing by num as we iterate up through the integers.
I have 2 functions that I want to run in parallel. The first called count which will increment all my counters (in parallel), and the second which will check if any of the counters read 0 (in parallel) If a counter reads 0 that num divides n evenly meaning that n isn't prime.
While I would like my code to only print prime numbers, it prints all the numbers...
Here's the code:
#include <stdio.h>
#include <stdlib.h>
typedef struct{
int num;
int count;
} counterClass;
counterClass new_counterClass(counterClass aCounter, int by, int count){
aCounter.num = by;
aCounter.count = count%by;
return aCounter;
}
__global__ void count(counterClass *Counters){
int idx = threadIdx.x+blockDim.x*blockIdx.x;
Counters[idx].count+=1;
if(Counters[idx].count == Counters[idx].num){
Counters[idx].count = 0;
}
__syncthreads();
}
__global__ void check(counterClass *Counters, bool *result){
int idx = threadIdx.x+blockDim.x*blockIdx.x;
if (Counters[idx].count == 0){
*result = false;
}
__syncthreads();
}
int main(){
int tPrimes = 5; // Total Primes to Find
int nPrimes = 1; // Number of Primes Found
bool *d_result, h_result=true;
counterClass *h_counters =(counterClass *)malloc(tPrimes*sizeof(counterClass));
h_counters[0]=new_counterClass(h_counters[0], 2 , 0);
counterClass *d_counters;
int n = 2;
cudaMalloc((void **)&d_counters, tPrimes*sizeof(counterClass));
cudaMalloc((void **)&d_result, sizeof(bool));
cudaMemcpy(d_counters, h_counters, tPrimes*sizeof(counterClass), cudaMemcpyHostToDevice);
while(nPrimes<tPrimes){
h_result=true;
cudaMemcpy(d_result, &h_result, sizeof(bool), cudaMemcpyHostToDevice);
n+=1;
count<<<1,nPrimes>>>(d_counters);
check<<<1,nPrimes>>>(d_counters,d_result);
cudaMemcpy(&h_result, d_result, sizeof(bool), cudaMemcpyDeviceToHost);
if(h_result){
printf("%d\n", n);
cudaMemcpy(h_counters, d_counters, tPrimes*sizeof(counterClass), cudaMemcpyDeviceToHost);
h_counters[nPrimes]=new_counterClass(h_counters[nPrimes], n , 0);
nPrimes += 1;
cudaMemcpy(d_counters, h_counters, tPrimes*sizeof(counterClass), cudaMemcpyHostToDevice);
}
}
}
There are some similar questions CUDA - Sieve of Eratosthenes division into parts and good examples posted as questions by people seeking to improve their code , CUDA Primes Generation & Low performance in CUDA prime number generator But reading through these hasn't helped me figure out what is going wrong in my code!
Any advice on how to more effectively debug while working with CUDA would be appreciated and if you can point out what I am doing wrong (because I know it's not the computers fault) you will have my respect forever.
edit:
apparently this issue is only happening for me so perhaps it's the way I'm running my code...
$ nvcc parraPrimes.cu -o primes
$ ./primes
3
4
5
6
additionally using cuda-memCheck as recommended:
$ cuda-memcheck ./primes
========= CUDA-MEMCHECK
3
4
5
6
========= ERROR SUMMARY: 0 errors
The output from dmesg |grep NVRM
is as follows::
[ 3.480443] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 304.131 Sun Nov 8 21:43:33 PST 2015
Nvidia-smi is not installed on my system.
Apt installing the nvidia-cuda-toolkit does not install cuda.
You can install cuda form nvidia's website. (*Use the .deb)