I'm evaluating CUDA and currently using Thrust library to sort numbers.
I'd like to create my own comparer for thrust::sort, but it slows down drammatically! I created my own less implemetation by just copying code from functional.h. However it seems to be compiled in some other way and works very slowly.
- default comparer: thrust::less() - 94ms
- my own comparer: less() - 906ms
I'm using Visual Studio 2010. What should I do to get the same performance as at option 1?
Complete code:
#include <stdio.h>
#include <cuda.h>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
int myRand()
{
static int counter = 0;
if ( counter++ % 10000 == 0 )
srand(time(NULL)+counter);
return (rand()<<16) | rand();
}
template<typename T>
struct less : public thrust::binary_function<T,T,bool>
{
__host__ __device__ bool operator()(const T &lhs, const T &rhs) const {
return lhs < rhs;
}
};
int main()
{
thrust::host_vector<int> h_vec(10 * 1000 * 1000);
thrust::generate(h_vec.begin(), h_vec.end(), myRand);
thrust::device_vector<int> d_vec = h_vec;
int clc = clock();
thrust::sort(d_vec.begin(), d_vec.end(), less<int>());
printf("%dms\n", (clock()-clc) * 1000 / CLOCKS_PER_SEC);
return 0;
}
The reason you are observing a difference in performance is because Thrust is implementing the sort with different algorithms depending on the arguments provided to
thrust::sort
.In case 1., Thrust can prove that the sort can be implemented in linear time with a radix sort. This is because the type of the data to sort is a built-in numeric type (
int
), and the comparison function is the built-in less than operation -- Thrust recognizes thatthrust::less<int>
will produce the equivalent result asx < y
.In case 2., Thrust knows nothing about your user-provided
less<int>
, and has to use a more conservative algorithm based on a comparison sort which has different asymptotic complexity, even though in truth yourless<int>
is equivalent tothrust::less<int>
.In general, user-defined comparison operators can't be used with more restrictive, faster sorts which manipulate the binary representation of data such as radix sort. In these cases, Thrust falls back on a more general, but slower sort.