I'm a novice in OpenCL.
I have an algorithm which uses templates. It worked well with OpenMP parallelization but now the amount of data has grown and the only way to process it is to rewrite it to use OpenCL.
I can easily use MPI to build it for cluster but Tesla-like GPU is much cheaper than cluster :)
Is there any way to use C++ templates in OpenCL kernel?
Is it possible to somehow expand templates by C++ compiler or some tool and after that use so changed kernel function?
EDIT. The idea of a workaround is to somehow generate C99-compatible code from C++ code from the template.
I found a following about Comeau:
Comeau C++ 4.3.3 is a full and true compiler that performs full syntax checking, full semantic checking, full error checking and all other compiler duties. Input C++ code is translated into internal compiler trees and symbol tables looking nothing like C++ or C. As well, it generates an internal proprietary intermediate form. But instead of using a proprietary back end code generator, Comeau C++ 4.3.3 generates C code as its output. Besides the technical advantages of C++, the C generating aspects of products like Comeau C++ 4.3.3 have been touted as a reason for C++'s success since it was able to be brought to a large number of platforms due to the common availability of C compilers.
The C compiler is used merely and only for the sake of obtaining native code generation. This means that Comeau C++ is tailored for use with specific C compilers on each respective platform. Please note that it is a requirement that tailoring must be done by Comeau. Otherwise, the generated C code is meaningless as it is tied to a specific platform (where platform includes at least the CPU, OS, and C compiler) and furthermore, the generated C code is not standalone. Therefore, it cannot be used by itself (note that this is both a technical and legal requirement when using Comeau C++), and this is why there is not normally an option to see the generated C code: it's almost always unhelpful and the compile process, including its generation, should be considered as internal phases of translation.
There is an old way to emulate templates in pure C language.
It is based on including a single file several times (without include guard).
Since OpenCL has fully functional preprocessor and allows including files, this trick can be used.
Here is a good explanation:
http://arnold.uthar.net/index.php?n=Work.TemplatesC
It is still much messier than C++ templates: the code has to be splitted into several parts, and you have to explicitly instantiate each instance of template. Also, it seems that you cannot do some useful things like implementing factorial as a recursive template.
Code example
Let's apply the idea to OpenCL. Suppose that we want to calculate inverse square root by Newton-Raphson iteration (generally not a good idea). However, the floating point type and the number of iterations may vary.
First of all, we need a helper header ("templates.h"):
#ifndef TEMPLATES_H_
#define TEMPLATES_H_
#define CAT(X,Y,Z) X##_##Y##_##Z //concatenate words
#define TEMPLATE(X,Y,Z) CAT(X,Y,Z)
#endif
Then, we write template function in "NewtonRaphsonRsqrt.cl":
#include "templates.h"
real TEMPLATE(NewtonRaphsonRsqrt, real, iters) (real x, real a) {
int i;
for (i = 0; i<iters; i++) {
x *= ((real)1.5 - (0.5*a)*x*x);
}
return x;
}
In your main .cl file, instantiate this template as follows:
#define real float
#define iters 2
#include "NewtonRaphsonRsqrt.cl" //defining NewtonRaphsonRsqrt_float_2
#define real double
#define iters 3
#include "NewtonRaphsonRsqrt.cl" //defining NewtonRaphsonRsqrt_double_3
#define real double
#define iters 4
#include "NewtonRaphsonRsqrt.cl" //defining NewtonRaphsonRsqrt_double_4
And then can use it like this:
double prec = TEMPLATE(NewtonRaphsonRsqrt, double, 4) (1.5, 0.5);
float approx = TEMPLATE(NewtonRaphsonRsqrt, float, 2) (1.5, 0.5);
I have written an experimental C++ to OpenCL C source transformation tool. The tool compiles C++ source (even some STL) into LLVM byte-code, and uses a modified version of the LLVM 'C' back-end to disassemble the byte-code into OpenCL 'C'.
Please see http://dimitri-christodoulou.blogspot.com/2013/12/writing-opencl-kernels-in-c.html
For example, this code using C++11's std::enable_if can be converted into OpenCL 'C' and then executed on the GPU:
#include <type_traits>
template<class T>
T foo(T t, typename std::enable_if<std::is_integral<T>::value >::type* = 0)
{
return 1;
}
template<class T>
T foo(T t, typename std::enable_if<std::is_floating_point<T>::value >::type* = 0)
{
return 0;
}
extern "C" void _Kernel_enable_if_int_argument(int* arg0, int* out)
{
out[0] = foo(arg0[0]);
}
You can have a look at VexCL which uses expression templates to generate OpenCL kernels. You can get some ideas on how to make OpenCL to work nicely with templates.
Another library that is being actively worked on is Boost.Compute which is a layer on top of OpenCL to allow generic C++ code.
The general idea is to create the kernel as a C string more or less and pass it down to the OpenCL runtime for compilation and execution.
If you're really determined to get it done, you could re-target your C++ compiler of a choice to generate NVidia PTX (and Clang is likely to be able to do it soon any way). But this way you'd bind your code to the NVidia hardware.
Another way is to implement a custom backend for LLVM, based on the current CBE, which will generate pure OpenCL code instead of C.
Note that the new SYCL Khronos standard has native support for C++ templates in OpenCL.
PyOpenCL is now using Mako as it's template engine. http://www.makotemplates.org/