Timing a CUDA application using events

I am using the following two functions to time different parts (cudaMemcpyHtoD, kernel execution, cudaMemcpyDtoH) of my code (which includes multi-gpus, concurrent kernels on same GPU, sequential execution of kernels, et al). As I understand, these functions would record the time elapsed between the events, but I guess inserting events along the lifetime of the code may result in overheads and inaccuracies. I would like to hear criticisms, general advice to improve these functions and caveat emptors regarding them.

//Create event and start recording
cudaEvent_t *start_event(int device, cudaEvent_t *events, cudaStream_t streamid=0)
{
        cutilSafeCall( cudaSetDevice(device) );
        cutilSafeCall( cudaEventCreate(&events[0]) );
        cutilSafeCall( cudaEventCreate(&events[1]) );
        cudaEventRecord(events[0], streamid);

    return events;
 }

 //Return elapsed time and destroy events
 float end_event(int device, cudaEvent_t *events, cudaStream_t streamid=0)
 {

        float elapsed = 0.0;
        cutilSafeCall( cudaSetDevice(device) );
        cutilSafeCall( cudaEventRecord(events[1], streamid) );
        cutilSafeCall( cudaEventSynchronize(events[1]) );
        cutilSafeCall( cudaEventElapsedTime(&elapsed, events[0], events[1]) );

        cutilSafeCall( cudaEventDestroy( events[0] ) );
        cutilSafeCall( cudaEventDestroy( events[1] ) );

        return elapsed;
 }

Usage:

cudaEvent_t *events;
cudaEvent_t event[2]; //0 for start and 1 for end
...
events = start_event( cuda_device, event, 0 );
<Code to time>
printf("Time taken for the above code... - %f secs\n\n", (end_event(cuda_device, events, 0) / 1000) );

First, if this is for production code, you may want to be able to do something between the second cudaEventRecord and cudaEventSynchronize(). Otherwise, this could reduce the ability of your app to overlap GPU and CPU work.

Next, I would separate event creation and destruction from event recording. I'm not sure of the cost, but in general you might not want to call cudaEventCreate and cudaEventDestroy often.

What I would do is create a class like this

class EventTimer {
public:
  EventTimer() : mStarted(false), mStopped(false) {
    cudaEventCreate(&mStart);
    cudaEventCreate(&mStop);
  }
  ~EventTimer() {
    cudaEventDestroy(mStart);
    cudaEventDestroy(mStop);
  }
  void start(cudaStream_t s = 0) { cudaEventRecord(mStart, s); 
                                   mStarted = true; mStopped = false; }
  void stop(cudaStream_t s = 0)  { assert(mStarted);
                                   cudaEventRecord(mStop, s); 
                                   mStarted = false; mStopped = true; }
  float elapsed() {
    assert(mStopped);
    if (!mStopped) return 0; 
    cudaEventSynchronize(mStop);
    float elapsed = 0;
    cudaEventElapsedTime(&elapsed, mStart, mStop);
    return elapsed;
  }

private:
  bool mStarted, mStopped;
  cudaEvent_t mStart, mStop;
};

Note I didn't include cudaSetDevice() -- seems to me that should be left to the code that uses this class, to make it more flexible. The user would have to ensure the same device is active when start and stop are called.

PS: It is not NVIDIA's intent for CUTIL to be relied upon for production code -- it is used simply for convenience in our examples and is not as rigorously tested or optimized as the CUDA libraries and compilers themselves. I recommend you extract things like cutilSafeCall() into your own libraries and headers.