How to create or manipulate GPU assembler?

Does any one have experience in creating/manipulating GPU machine code, possibly at run-time?

I am interested in modifying GPU assembler code, possibly at run time with minimal overhead. Specifically I'm interested in assembler based genetic programming.

I understand ATI has released ISAs for some of their cards, and nvidia recently released a disassembler for CUDA for older cards, but I am not sure if it is possible to modify instructions in memory at runtime or even before hand.

Is this possible? Any related information is welcome.

标签： assembly stream cuda gpgpu

6条回答

够拽才男人

2楼-- · 2019-01-21 23:53

These links might be interesting for you, although its easy to find them, so probably you already seen it:

http://www.worldlingo.com/ma/enwiki/en/ARB_(GPU_assembly_language)

http://developer.nvidia.com/object/gpu_programming_guide.html

http://developer.amd.com/gpu/Pages/default.aspx

http://msdn.microsoft.com/en-us/library/bb219840.aspx

http://www.khronos.org/opencl/

http://www.comp.nus.edu.sg/~ashwinna/docs/CS6282_Modeling_the_GPU.pdf

0人赞添加讨论(0) 举报

傲

3楼-- · 2019-01-21 23:56

In the CUDA driver API, the module management functions allow an application to load at runtime a "module", which is (roughly) a PTX or cubin file. PTX is the intermediate language, while cubin is an already compiled set of instructions. cuModuleLoadData() and cuModuleLoadDataEx() appear to be capable of "loading" the module from a pointer in RAM, which means that no actual file is required.

So your problem seems to be: how to programmatically build a cubin module in RAM ? As far as I know, NVIDIA never released details on the instructions actually understood by their hardware. There is, however, an independent opensource package called decuda which includes "cudasm", a assembler for what the "older" NVIDIA GPU understand ("older" = GeForce 8xxx and 9xxx). I do not know how easy it would be to integrate in a wider application; it is written in Python.

Newer NVIDIA GPU use a distinct instruction set (how much distinct, I do not know), so a cubin for an old GPU ("computing capability 1.x" in NVIDIA/CUDA terminology) may not work on a recent GPU (computing capability 2.x, i.e. "Fermi architecture" such as a GTX 480). Which is why PTX is usually preferred: a given PTX file will be portable across GPU generations.

0人赞添加讨论(0) 举报

看我几分像从前

4楼-- · 2019-01-21 23:56

An assembler for the NVIDIA Fermi ISA: http://code.google.com/p/asfermi

0人赞添加讨论(0) 举报

欢心

5楼-- · 2019-01-22 00:00

NVIDIA PTX generation and modification

Not sure how low level it is compared to the hardware (likely undocumented?), but it can be generated from C/C++-like GPU languages, modified and reused in a few ways:

OpenCL clGetProgramInfo(program, CL_PROGRAM_BINARIES + clCreateProgramWithBinary: minimal runnable example: How to use clCreateProgramWithBinary in OpenCL?

These are standardized OpenCL API's, which produce and consume implementation defined formats, which in driver version 375.39 for Linux happens to be human readable PTX.

So you can dump the PTX, modify it, and reload.
nvcc: can compile CUDA GPU-side code to ptx assembly simply with either:
```
nvcc --ptx a.cu
```
nvcc can also compile OpenCL C programs containing both device and host code: Compile and build .cl file using NVIDIA's nvcc Compiler? but I could not find how to get the ptx out with nvcc. Which kind of makes sense since it is just plain C + C strings, and not a magic C superset. This is also suggested by: https://arrayfire.com/generating-ptx-files-from-opencl-code/

And I'm not sure how to recompile the modified PTX and use it as I did with clCreateProgramWithBinary: How to compile PTX code

Using clGetProgramInfo, an input CL kernel:

__kernel void kmain(__global int *out) {
    out[get_global_id(0)]++;
}

gets compiled to some PTX like:

//
// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-21124049
// Cuda compilation tools, release 8.0, V8.0.44
// Based on LLVM 3.4svn
//

.version 5.0
.target sm_20
.address_size 64

    // .globl   _Z3incPi

.visible .entry _Z3incPi(
    .param .u64 _Z3incPi_param_0
)
{
    .reg .pred  %p<2>;
    .reg .b32   %r<4>;
    .reg .b64   %rd<5>;


    ld.param.u64    %rd1, [_Z3incPi_param_0];
    mov.u32     %r1, %ctaid.x;
    setp.gt.s32 %p1, %r1, 2;
    @%p1 bra    BB0_2;

    cvta.to.global.u64  %rd2, %rd1;
    mul.wide.s32    %rd3, %r1, 4;
    add.s64     %rd4, %rd2, %rd3;
    ldu.global.u32  %r2, [%rd4];
    add.s32     %r3, %r2, 1;
    st.global.u32   [%rd4], %r3;

BB0_2:
    ret;
}

Then if for example you modify the line:

add.s32     %r3, %r2, 1;

to:

add.s32     %r3, %r2, 2;

and reuse the PTX modified, it actually increments by 2 instead of 1 as expected.

0人赞添加讨论(0) 举报

爷的心禁止访问

6楼-- · 2019-01-22 00:09

OpenCL is done for that purpose. You provide a program as a string and possibly compile it at runtime. See links provided by other poster.

0人赞添加讨论(0) 举报

聊天终结者

7楼-- · 2019-01-22 00:17

I've found gpuocelot open-source (BSD Licence) project interesting.

It's "a dynamic compilation framework for PTX". I would call it cpu translator.

"Ocelot currently allows CUDA programs to be executed on NVIDIA GPUs, AMD GPUs, and x86-CPUs". As far as I know, this framework do control-flow and data-flow analysis on PTX Kernel in order to apply proper transformations.

0人赞添加讨论(0) 举报

How to create or manipulate GPU assembler?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间