Does any one have experience in creating/manipulating GPU machine code, possibly at run-time?
I am interested in modifying GPU assembler code, possibly at run time with minimal overhead. Specifically I'm interested in assembler based genetic programming.
I understand ATI has released ISAs for some of their cards, and nvidia recently released a disassembler for CUDA for older cards, but I am not sure if it is possible to modify instructions in memory at runtime or even before hand.
Is this possible? Any related information is welcome.
These links might be interesting for you, although its easy to find them, so probably you already seen it:
http://www.worldlingo.com/ma/enwiki/en/ARB_(GPU_assembly_language)
http://developer.nvidia.com/object/gpu_programming_guide.html
http://developer.amd.com/gpu/Pages/default.aspx
http://msdn.microsoft.com/en-us/library/bb219840.aspx
http://www.khronos.org/opencl/
http://www.comp.nus.edu.sg/~ashwinna/docs/CS6282_Modeling_the_GPU.pdf
In the CUDA driver API, the module management functions allow an application to load at runtime a "module", which is (roughly) a PTX or cubin file. PTX is the intermediate language, while cubin is an already compiled set of instructions. cuModuleLoadData()
and cuModuleLoadDataEx()
appear to be capable of "loading" the module from a pointer in RAM, which means that no actual file is required.
So your problem seems to be: how to programmatically build a cubin module in RAM ? As far as I know, NVIDIA never released details on the instructions actually understood by their hardware. There is, however, an independent opensource package called decuda which includes "cudasm", a assembler for what the "older" NVIDIA GPU understand ("older" = GeForce 8xxx and 9xxx). I do not know how easy it would be to integrate in a wider application; it is written in Python.
Newer NVIDIA GPU use a distinct instruction set (how much distinct, I do not know), so a cubin for an old GPU ("computing capability 1.x" in NVIDIA/CUDA terminology) may not work on a recent GPU (computing capability 2.x, i.e. "Fermi architecture" such as a GTX 480). Which is why PTX is usually preferred: a given PTX file will be portable across GPU generations.
I've found gpuocelot open-source (BSD Licence) project interesting.
It's "a dynamic compilation framework for PTX". I would call it cpu translator.
"Ocelot currently allows CUDA programs to be executed on NVIDIA GPUs, AMD GPUs, and x86-CPUs". As far as I know, this framework do control-flow and data-flow analysis on PTX Kernel in order to apply proper transformations.
OpenCL is done for that purpose. You provide a program as a string and possibly compile it at runtime. See links provided by other poster.
An assembler for the NVIDIA Fermi ISA: http://code.google.com/p/asfermi
NVIDIA PTX generation and modification
Not sure how low level it is compared to the hardware (likely undocumented?), but it can be generated from C/C++-like GPU languages, modified and reused in a few ways:
OpenCL clGetProgramInfo(program, CL_PROGRAM_BINARIES
+ clCreateProgramWithBinary
: minimal runnable example: How to use clCreateProgramWithBinary in OpenCL?
These are standardized OpenCL API's, which produce and consume implementation defined formats, which in driver version 375.39 for Linux happens to be human readable PTX.
So you can dump the PTX, modify it, and reload.
nvcc
: can compile CUDA GPU-side code to ptx assembly simply with either:
nvcc --ptx a.cu
nvcc
can also compile OpenCL C programs containing both device and host code: Compile and build .cl file using NVIDIA's nvcc Compiler? but I could not find how to get the ptx out with nvcc. Which kind of makes sense since it is just plain C + C strings, and not a magic C superset. This is also suggested by: https://arrayfire.com/generating-ptx-files-from-opencl-code/
And I'm not sure how to recompile the modified PTX and use it as I did with clCreateProgramWithBinary
: How to compile PTX code
Using clGetProgramInfo
, an input CL kernel:
__kernel void kmain(__global int *out) {
out[get_global_id(0)]++;
}
gets compiled to some PTX like:
//
// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-21124049
// Cuda compilation tools, release 8.0, V8.0.44
// Based on LLVM 3.4svn
//
.version 5.0
.target sm_20
.address_size 64
// .globl _Z3incPi
.visible .entry _Z3incPi(
.param .u64 _Z3incPi_param_0
)
{
.reg .pred %p<2>;
.reg .b32 %r<4>;
.reg .b64 %rd<5>;
ld.param.u64 %rd1, [_Z3incPi_param_0];
mov.u32 %r1, %ctaid.x;
setp.gt.s32 %p1, %r1, 2;
@%p1 bra BB0_2;
cvta.to.global.u64 %rd2, %rd1;
mul.wide.s32 %rd3, %r1, 4;
add.s64 %rd4, %rd2, %rd3;
ldu.global.u32 %r2, [%rd4];
add.s32 %r3, %r2, 1;
st.global.u32 [%rd4], %r3;
BB0_2:
ret;
}
Then if for example you modify the line:
add.s32 %r3, %r2, 1;
to:
add.s32 %r3, %r2, 2;
and reuse the PTX modified, it actually increments by 2 instead of 1 as expected.