Is there an efficient way to take a sorted key/value array pair and ensure that each key has an equal number of elements using the CUDA Thrust library?
For instance, assume we have the following pair of arrays:
ID: 1 2 2 3 3 3
VN: 6 7 8 5 7 8
If we want to have two of each key appear, this would be the result:
ID: 2 2 3 3
VN: 7 8 5 7
The actual arrays will be much larger, containing millions of elements or more. I'm able to do this using nested for-loops easily, but I'm interested in knowing whether or not there's a more efficient way to convert the arrays using a GPU. Thrust seems as though it may be useful, but I don't see any obvious functions to use.
Thank you for your help!
Caveat: If this is the only operation you plan to do on the GPU, I would not recommend it. The cost to copy the data to/from the GPU will likely outweigh any possible efficiency/performance benefit from using the GPU.
EDIT: based on the comments that the sequence threshold is likely to be much longer than 2, I'll suggest an alternate method (method 2) that should be more efficient than a for-loop or brute-force method (method 1).
In general I would place this problem in a category called stream compaction. Stream compaction generally refers to taking a sequence of data and reducing it to a smaller sequence of data.
If we look in the thrust stream compaction area, an algorithm that could be made to work for this problem is thrust::copy_if() (in particular, for convenience, the version that takes a stencil array).
method 1:
To think about this problem in parallel, we must ask ourselves under what condition should a given element be copied from the input to the output? If we can formalize this logic, we can construct a thrust functor which we can pass to thrust::copy_if
to instruct it as to which elements to copy.
For a given element, for the sequence length = 2 case, we can construct a complete logic if we know:
- the element
- the element one place to the right
- the element one place to the left
- the element two places to the left
Based on the above, we will need to come up with "special case" logic for those elements for which any of the items 2,3, or 4 above are undefined.
Ignoring the special cases, if we know the above 4 items, then we can construct the necessary logic as follows:
If the element to my left is the same as me, but the element two places to the left is different, then I belong in the output
If the element to my left is different than me, but the element to my right is the same as me, I belong in the output
Otherwise, I don't belong in the output
I'll leave it to you to construct the necessary logic for the special cases. (Or reverse-engineer it from the code I've provided).
method 2:
For long sequences, method 1 or a for-loop variant of the logic in method 1 will generate at least 1 read of the data set per element of the sequence length. For a long sequence (e.g. 2000) this will be inefficient. Therefore another possible approach would be as follows:
Generate an exclusive_scan_by_key in both forward and reverse directions, using the ID values as the key, and a thrust::constant_iterator
(value=1) as the values for the scan. For the given data set, that creates intermediate results like this:
ID: 1 2 2 3 3 3
VN: 6 7 8 5 7 8
FS: 0 0 1 0 1 2
RS: 0 1 0 2 1 0
where FS and RS are the results of the forward and reverse scan-by-key. We generate the reverse scan (RS) using .rbegin()
and .rend()
reverse iterators. Note that this has to be done both for the reverse scan input and output, in order to generate the RS sequence as above.
- The logic for our
thrust::copy_if
functor then becomes fairly simple. For a given element, if the sum of the RS and FS value for that element is greater than or equal to the desired minimum sequence length (-1 to account for exclusive scan operation) and the FS value is less than the desired minimum sequence length, then that element belongs in the output.
Here's a fully worked example of both methods, using the given data, for sequence length 2:
$ cat t1095.cu
#include <thrust/device_vector.h>
#include <thrust/copy.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/zip_iterator.h>
#include <iostream>
#include <thrust/scan.h>
#include <thrust/iterator/constant_iterator.h>
struct copy_func
{
int *d;
int dsize, r, l, m, l2;
copy_func(int *_d, int _dsize) : d(_d),dsize(_dsize) {};
__host__ __device__
bool operator()(int idx)
{
m = d[idx];
// handle typical case
// this logic could be replaced by a for-loop for sequences of arbitrary length
if ((idx > 1) && (idx < dsize-1)){
r = d[idx+1];
l = d[idx-1];
l2 = d[idx-2];
if ((r == m) && (m != l)) return true;
if ((l == m) && (m != l2)) return true;
return false;}
// handle special cases
if (idx == 0){
r = d[idx+1];
return (r == m);}
if (idx == 1){
r = d[idx+1];
l = d[idx-1];
if (l == m) return true;
else if (r == m) return true;
return false;}
if (idx == dsize-1){
l = d[idx-1];
l2 = d[idx-2];
if ((m == l) && (m != l2)) return true;
return false;}
// could put assert(0) here, should never get here
return false;
}
};
struct copy_func2
{
int thresh;
copy_func2(int _thresh) : thresh(_thresh) {};
template <typename T>
__host__ __device__
bool operator()(T t){
return (((thrust::get<0>(t) + thrust::get<1>(t))>=(thresh-1)) && (thrust::get<0>(t) < thresh));
}
};
int main(){
const int length_threshold = 2;
int ID[] = {1,2,2,3,3,3};
int VN[] = {6,7,8,5,7,8};
int dsize = sizeof(ID)/sizeof(int);
// we assume dsize > 3
thrust::device_vector<int> id(ID, ID+dsize);
thrust::device_vector<int> vn(VN, VN+dsize);
thrust::device_vector<int> res_id(dsize);
thrust::device_vector<int> res_vn(dsize);
thrust::counting_iterator<int> idx(0);
//method 1: sequence length threshold of 2
int rsize = thrust::copy_if(thrust::make_zip_iterator(thrust::make_tuple(id.begin(), vn.begin())), thrust::make_zip_iterator(thrust::make_tuple(id.end(), vn.end())), idx, thrust::make_zip_iterator(thrust::make_tuple(res_id.begin(), res_vn.begin())), copy_func(thrust::raw_pointer_cast(id.data()), dsize)) - thrust::make_zip_iterator(thrust::make_tuple(res_id.begin(), res_vn.begin()));
std::cout << "ID: ";
thrust::copy_n(res_id.begin(), rsize, std::ostream_iterator<int>(std::cout, " "));
std::cout << std::endl << "VN: ";
thrust::copy_n(res_vn.begin(), rsize, std::ostream_iterator<int>(std::cout, " "));
std::cout << std::endl;
//method 2: for arbitrary sequence length threshold
thrust::device_vector<int> res_fs(dsize);
thrust::device_vector<int> res_rs(dsize);
thrust::exclusive_scan_by_key(id.begin(), id.end(), thrust::constant_iterator<int>(1), res_fs.begin());
thrust::exclusive_scan_by_key(id.rbegin(), id.rend(), thrust::constant_iterator<int>(1), res_rs.begin());
rsize = thrust::copy_if(thrust::make_zip_iterator(thrust::make_tuple(id.begin(), vn.begin())), thrust::make_zip_iterator(thrust::make_tuple(id.end(), vn.end())), thrust::make_zip_iterator(thrust::make_tuple(res_fs.begin(), res_rs.rbegin())), thrust::make_zip_iterator(thrust::make_tuple(res_id.begin(), res_vn.begin())), copy_func2(length_threshold)) - thrust::make_zip_iterator(thrust::make_tuple(res_id.begin(), res_vn.begin()));
std::cout << "ID: ";
thrust::copy_n(res_id.begin(), rsize, std::ostream_iterator<int>(std::cout, " "));
std::cout << std::endl << "VN: ";
thrust::copy_n(res_vn.begin(), rsize, std::ostream_iterator<int>(std::cout, " "));
std::cout << std::endl;
return 0;
}
$ nvcc -o t1095 t1095.cu
$ ./t1095
ID: 2 2 3 3
VN: 7 8 5 7
ID: 2 2 3 3
VN: 7 8 5 7
Notes:
the copy_func
implements the test logic for a given element for method 1. It receives the index of that element (via the stencil) as well as a pointer to the ID
data on the device, and the size of the data, via functor initialization parameters. The variables r
, m
, l
, and l2
refer to the element to my right, myself, the element to my left, and the element two places to my left, respectively.
we are passing a pointer to the ID
data to the functor. This allows the functor to retrieve the (up to) 4 necessary elements for the test logic. This avoids a messy construction of a thrust::zip_iterator to provide all these values. Note that the reads of these elements in the functor should coalesce nicely, and therefore be fairly efficient, and also benefit from the cache.
I don't claim that this is defect-free. I think I got the test logic right, but it's possible I didn't. You should verify the logical correctness of that portion of the code, at least. My purpose is not to give you a black-box piece of code, but to demonstrate how to think your way through the problem.
This approach may get cumbersome for key sequences longer than 2. In that case I would suggest method 2. (If you already have a sequential for-loop that implements the necessary logic, you may be able to drop a modified version of that into the method 1 functor for longer key sequences. Such a for-loop should probably still benefit from coalesced access and adjacent accesses from the cache.)