CUDA C++11, array of lambdas, function by index, n

2019-07-31 23:43发布

I am having trouble trying to make a CUDA program manage an array of lambdas by their index. An example code that reproduces the problem

 #include <cuda.h>
 #include <vector>
 #include <stdio.h>
 #include <stdlib.h>
 #include <time.h>
 #include <sys/time.h>
 #include <cassert>

 #define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
 inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true){
     if (code != cudaSuccess) {
         fprintf(stderr,"GPUassert: %s %s %d\n",
         cudaGetErrorString(code), file, line);
         if (abort) exit(code);
     }   
 }

 template<typename Lambda>
 __global__ void kernel(Lambda f){ 
     int t = blockIdx.x * blockDim.x + threadIdx.x;
     printf("device: thread %i: ", t); 
     printf("f() = %i\n", f() );
 }

 int main(int argc, char **argv){
     // arguments
     if(argc != 2){ 
         fprintf(stderr, "run as ./prog i\nwhere 'i' is function index");
         exit(EXIT_FAILURE);
     }   
     int i = atoi(argv[1]);


     // lambdas
     auto lam0 = [] __host__ __device__ (){ return 333; };
     auto lam1 = [] __host__ __device__ (){ return 777; };


     // make vector of functions
     std::vector<int(*)()> v;
     v.push_back(lam0);
     v.push_back(lam1);


     // host: calling a function by index
     printf("host: f() = %i\n", (*v[i])() );


     // device: calling a function by index
     kernel<<< 1, 1 >>>( v[i] ); // does not work
     //kernel<<< 1, 1 >>>( lam0 ); // does work
     gpuErrchk( cudaPeekAtLastError() );
     gpuErrchk( cudaDeviceSynchronize() );
     return EXIT_SUCCESS;
 }

Compiling with

nvcc -arch sm_60 -std=c++11 --expt-extended-lambda main.cu -o prog

The error I get when running is

➜  cuda-lambda ./prog 0
host: f() = 333
device: GPUassert: invalid program counter main.cu 53

It seems that CUDA cannot manage the int(*)() function pointer form (while host c++ does work properly). On the other hand, each lambda is managed as a different data type, no matter if they are identical in code and have the same contract. Then, how can we achieve function by index in CUDA?

1条回答
趁早两清
2楼-- · 2019-08-01 00:14

There are a few considerations here.

Although you suggest wanting to "manage an array of lambdas", you are actually relying on the graceful conversion of a lambda to a function pointer (possible when the lambda does not capture).

When you mark something as __host__ __device__, you are declaring to the compiler that two copies of said item need to be compiled (with two obviously different entry points): one for the CPU, and one for the GPU.

When we take a __host__ __device__ lambda and ask it to degrade to a function pointer, we are then left with the question "which function pointer (entry point) to choose?" The compiler no longer has the option to carry about the experimental lambda object anymore, and so it must choose one or the other (host or device, CPU or GPU) for your vector. Whichever one it chooses, the vector could (will) break if used in the wrong environment.

One takeaway from this is that your two test cases are not the same. In one case (broken) you are passing a function pointer to the kernel (so the kernel is templated to accept a function pointer argument) and in the other case (working) you are passing a lambda to the kernel (so the kernel is templated to accept a lambda argument).

The problem here, in my view, is not simply arising out of use of a container, but arising out of the type of container you are using. I can demonstrate this in a simple way (see below) by converting your vector to a vector of actual lambda type. In that case, we can make the code "work" (sort of), but since every lambda has a unique type, this is an uninteresting demonstration. We can create a multi-element vector, but the only element we can store in it is one of your two lambdas (not both at the same time).

If we use a container that can handle dissimilar types (e.g. std::tuple), perhaps we can make some progress here, but I know of no direct method to index through the elements of such a container. Even if we could, the template kernel accepting lambda as argument/template type would have to be instantiated for each lambda.

In my view, function pointers avoid this particular type "messiness".

Therefore, as an answer to this question:

Then, how can we achieve function by index in CUDA?

I would suggest for the time being that function by index in host code be separated (e.g. two separate containers) from function by index in device code, and for function by index in device code, you use any of the techniques (which don't use or depend on lambdas) covered in other questions, such as this one.

Here is a worked example (I think) demonstrating the note above, that we can create a vector of lambda "type", and use the resultant element(s) from that vector as lambdas in both host and device code:

$ cat t64.cu
 #include <cuda.h>
 #include <vector>
 #include <stdio.h>
 #include <stdlib.h>
 #include <time.h>
 #include <sys/time.h>
 #include <cassert>

 #define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
 inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true){
     if (code != cudaSuccess) {
         fprintf(stderr,"GPUassert: %s %s %d\n",
         cudaGetErrorString(code), file, line);
         if (abort) exit(code);
     }
 }


 template<typename Lambda>
 __global__ void kernel(Lambda f){
     int t = blockIdx.x * blockDim.x + threadIdx.x;
     printf("device: thread %i: ", t);
     printf("f() = %i\n", f() );
 }

 template <typename T>
 std::vector<T> fill(T L0, T L1){
   std::vector<T> v;
   v.push_back(L0);
   v.push_back(L1);
   return v;
}

 int main(int argc, char **argv){
     // arguments
     if(argc != 2){
         fprintf(stderr, "run as ./prog i\nwhere 'i' is function index");
         exit(EXIT_FAILURE);
     }
     int i = atoi(argv[1]);


     // lambdas
     auto lam0 = [] __host__ __device__ (){ return 333; };
     auto lam1 = [] __host__ __device__ (){ return 777; };

     auto v = fill(lam0, lam0);

     // make vector of functions
 //    std::vector< int(*)()> v;
 //    v.push_back(lam0);
 //    v.push_back(lam1);


     // host: calling a function by index
     // host: calling a function by index
     printf("host: f() = %i\n", (*v[i])() );


     // device: calling a function by index
     kernel<<< 1, 1 >>>( v[i] ); // does not work
     //kernel<<< 1, 1 >>>( lam0 ); // does work
     gpuErrchk( cudaPeekAtLastError() );
     gpuErrchk( cudaDeviceSynchronize() );
     return EXIT_SUCCESS;
 }

$ nvcc -arch sm_61 -std=c++11 --expt-extended-lambda t64.cu -o t64
$ cuda-memcheck ./t64 0
========= CUDA-MEMCHECK
host: f() = 333
device: thread 0: f() = 333
========= ERROR SUMMARY: 0 errors
$ cuda-memcheck ./t64 1
========= CUDA-MEMCHECK
host: f() = 333
device: thread 0: f() = 333
========= ERROR SUMMARY: 0 errors
$

As mentioned above already, this code is not a sensible code. It is advanced to prove a particular point.

查看更多
登录 后发表回答