I want to use assembly code in CUDA C code in order to reduce expensive executions as we do using asm in c programming.
Is it possible?
I want to use assembly code in CUDA C code in order to reduce expensive executions as we do using asm in c programming.
Is it possible?
No, you can't, there is nothing like the asm constructs from C/C++. What you can do is tweak the generated PTX assembly and then use it with CUDA.
See this for an example.
But for GPUs, assembly optimizations are NOT necessary, you should do other optimizations first, such as memory coalescency and occupancy. See the CUDA Best Practices guide for more information.
Since CUDA 4.0, inline PTX is supported by the CUDA toolchain. There is a document in the toolkit that describes it: Using_Inline_PTX_Assembly_In_CUDA.pdf
Below is some code demonstrating use of inline PTX in CUDA 4.0. Note that this code should not be used as a replacement for CUDA's built-in __clz() function, I merely wrote it to explore aspects of the new inline PTX capability.