I am having trouble trying to make a CUDA program manage an array of lambdas by their index. An example code that reproduces the problem
#include <cuda.h>
#include <vector>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#include <cassert>
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true){
if (code != cudaSuccess) {
fprintf(stderr,"GPUassert: %s %s %d\n",
cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
template<typename Lambda>
__global__ void kernel(Lambda f){
int t = blockIdx.x * blockDim.x + threadIdx.x;
printf("device: thread %i: ", t);
printf("f() = %i\n", f() );
}
int main(int argc, char **argv){
// arguments
if(argc != 2){
fprintf(stderr, "run as ./prog i\nwhere 'i' is function index");
exit(EXIT_FAILURE);
}
int i = atoi(argv[1]);
// lambdas
auto lam0 = [] __host__ __device__ (){ return 333; };
auto lam1 = [] __host__ __device__ (){ return 777; };
// make vector of functions
std::vector<int(*)()> v;
v.push_back(lam0);
v.push_back(lam1);
// host: calling a function by index
printf("host: f() = %i\n", (*v[i])() );
// device: calling a function by index
kernel<<< 1, 1 >>>( v[i] ); // does not work
//kernel<<< 1, 1 >>>( lam0 ); // does work
gpuErrchk( cudaPeekAtLastError() );
gpuErrchk( cudaDeviceSynchronize() );
return EXIT_SUCCESS;
}
Compiling with
nvcc -arch sm_60 -std=c++11 --expt-extended-lambda main.cu -o prog
The error I get when running is
➜ cuda-lambda ./prog 0
host: f() = 333
device: GPUassert: invalid program counter main.cu 53
It seems that CUDA cannot manage the int(*)() function pointer form (while host c++ does work properly). On the other hand, each lambda is managed as a different data type, no matter if they are identical in code and have the same contract. Then, how can we achieve function by index in CUDA?
There are a few considerations here.
Although you suggest wanting to "manage an array of lambdas", you are actually relying on the graceful conversion of a lambda to a function pointer (possible when the lambda does not capture).
When you mark something as
__host__ __device__
, you are declaring to the compiler that two copies of said item need to be compiled (with two obviously different entry points): one for the CPU, and one for the GPU.When we take a
__host__ __device__
lambda and ask it to degrade to a function pointer, we are then left with the question "which function pointer (entry point) to choose?" The compiler no longer has the option to carry about the experimental lambda object anymore, and so it must choose one or the other (host or device, CPU or GPU) for your vector. Whichever one it chooses, the vector could (will) break if used in the wrong environment.One takeaway from this is that your two test cases are not the same. In one case (broken) you are passing a function pointer to the kernel (so the kernel is templated to accept a function pointer argument) and in the other case (working) you are passing a lambda to the kernel (so the kernel is templated to accept a lambda argument).
The problem here, in my view, is not simply arising out of use of a container, but arising out of the type of container you are using. I can demonstrate this in a simple way (see below) by converting your vector to a vector of actual lambda type. In that case, we can make the code "work" (sort of), but since every lambda has a unique type, this is an uninteresting demonstration. We can create a multi-element vector, but the only element we can store in it is one of your two lambdas (not both at the same time).
If we use a container that can handle dissimilar types (e.g.
std::tuple
), perhaps we can make some progress here, but I know of no direct method to index through the elements of such a container. Even if we could, the template kernel accepting lambda as argument/template type would have to be instantiated for each lambda.In my view, function pointers avoid this particular type "messiness".
Therefore, as an answer to this question:
I would suggest for the time being that function by index in host code be separated (e.g. two separate containers) from function by index in device code, and for function by index in device code, you use any of the techniques (which don't use or depend on lambdas) covered in other questions, such as this one.
Here is a worked example (I think) demonstrating the note above, that we can create a vector of lambda "type", and use the resultant element(s) from that vector as lambdas in both host and device code:
As mentioned above already, this code is not a sensible code. It is advanced to prove a particular point.