how fast is thrust::sort and what is the fastest r

2020-07-22 17:07发布

I'm a newbie to GPU programming. Recently, I'm trying to implement the gpu bvh construction algorithm based on an tutorial: http://devblogs.nvidia.com/parallelforall/thinking-parallel-part-iii-tree-construction-gpu/. In the first step of this algorithm, the morton code(unsigned int) of every primitive is computed and sorted. The tutorial gives a reference time cost of the morton code computing and sorting for 12K objects:

  1. 0.02 ms, one thread per object: Calculate bounding box and assign Morton code.
  2. 0.18 ms, parallel radix sort: Sort the objects according to their Morton codes. ...

In my implementation, the first step cost 0.1ms and the sorting step costs 1.8ms. I'm using thrust to do the sorting. So what is the fastest implementation of radix sort on GPU?

I'm using a Geforce Titan GPU which should faster than the Geforce GTX690 used by the author of the tutorial. Here is my test code for sorting, it costs about 1.5ms even when the size is 10.

void testSort()
{
    int sz = 10;
    thrust::host_vector<unsigned int> h_keys(sz);
    for(int i=0; i<sz; i++)
        h_keys[i] = rand();
    thrust::device_ptr<unsigned int> keys = thrust::device_malloc<unsigned int>(sz);
    thrust::copy(h_keys.begin(),h_keys.end(),keys);
    cudaEvent_t estart, estop;
    cudaEventCreate( &estart );
    cudaEventCreate( &estop );
    cudaEventRecord( estart, 0 );
    thrust::stable_sort(keys,keys+sz);
    cudaEventRecord( estop, 0 ) ;
    cudaEventSynchronize( estop );
    float elapsedTime;
    cudaEventElapsedTime( &elapsedTime,
        estart, estop ) ;
    printf( "Time to sort: %3.1f ms\n", elapsedTime );
    cudaEventDestroy( estart ) ;
    cudaEventDestroy( estop ) ;
}

1条回答
家丑人穷心不美
2楼-- · 2020-07-22 17:47

There is a Radix sort implementation for GPGPU by back40computing. They provide a performance comparison chart with which they claim that their implementation is the fastest.

查看更多
登录 后发表回答