在CUDA内核时序不同的部分(Timing different sections in CUDA k

我有一个CUDA内核调用了一系列的设备功能。

什么是让执行时间为每个设备功能的最佳方法是什么？

什么是获得上，以在设备功能之一的一段代码，执行时间的最好方法？

在我自己的代码，我用的是clock()函数来获取精确的计时。为了方便起见，我有宏

enum {
    tid_this = 0,
    tid_that,
    tid_count
    };
__device__ float cuda_timers[ tid_count ];
#ifdef USETIMERS
 #define TIMER_TIC clock_t tic; if ( threadIdx.x == 0 ) tic = clock();
 #define TIMER_TOC(tid) clock_t toc = clock(); if ( threadIdx.x == 0 ) atomicAdd( &cuda_timers[tid] , ( toc > tic ) ? (toc - tic) : ( toc + (0xffffffff - tic) ) );
#else
 #define TIMER_TIC
 #define TIMER_TOC(tid)
#endif

然后，这些可用于仪器设备代码如下：

__global__ mykernel ( ... ) {

    /* Start the timer. */
    TIMER_TIC

    /* Do stuff. */
    ...

    /* Stop the timer and store the results to the "timer_this" counter. */
    TIMER_TOC( tid_this );

    }

然后，您可以阅读cuda_timers在主机代码。

几点注意事项：

该定时器在每块的基础上，也就是工作，如果你有100块执行相同的内核，他们的所有时间的总和将被保存。
话虽如此，计时器假定零线程处于活动状态，所以一定要确保你不会在代码的可能扩散部调用这些宏。
定时器计数时钟周期数。要获得毫秒数，由千兆赫的设备上的数字除以这个和乘以1000。
定时器可有点你的代码减慢，这就是为什么我在包裹其中#ifdef USETIMERS让您可以轻松将其关闭。
虽然clock()返回类型的整数值clock_t ，我存储的累积值作为float ，否则这些值将环绕该采取超过几秒钟（累积所有块）更长的内核。
选择( toc > tic ) ? (toc - tic) : ( toc + (0xffffffff - tic) ) ) ( toc > tic ) ? (toc - tic) : ( toc + (0xffffffff - tic) ) )是必要的情况下，时钟计数器回绕。

PS这是我的答辩书副本到这个问题，因为所需要的时间是整个内核这并没有得到很多分那里。