Thrust - accessing neighbors

2019-08-11 18:28发布

I would like to use Thrust's stream compaction functionality (copy_if) for distilling indices of elements from a vector if the elements adhere to a number of constraints. One of these constraints depends on the values of neighboring elements (8 in 2D and 26 in 3D). My question is: how can I obtain the neighbors of an element in Thrust?

The function call operator of the functor for the 'copy_if' basically looks like:

__host__ __device__ bool operator()(float x) {
    bool mark = x < 0.0f;
    if (mark) {
        if (left neighbor of x > 1.0f) return false;
        if (right neighbor of x > 1.0f) return false;
        if (top neighbor of x > 1.0f) return false;
        //etc.
    }
    return mark;
}

Currently I use a work-around by first launching a CUDA kernel (in which it is easy to access neighbors) to appropriately mark the elements. After that, I pass the marked elements to Thrust's copy_if to distill the indices of the marked elements.


I came across counting_iterator as a sort of substitute for directly using threadIdx and blockIdx to acquire the index of the processed element. I tried the solution below, but when compiling it, it gives me a "/usr/include/cuda/thrust/detail/device/cuda/copy_if.inl(151): Error: Unaligned memory accesses not supported". As far as I know I'm not trying to access memory in an unaligned fashion. Anybody knows what's going on and/or how to fix this?

struct IsEmpty2 {
    float* xi;

    IsEmpty2(float* pXi) { xi = pXi; }

    __host__ __device__ bool operator()(thrust::tuple<float, int> t) {
        bool mark = thrust::get<0>(t) < -0.01f;
        if (mark) {
            int countindex = thrust::get<1>(t);
            if (xi[countindex] > 1.01f) return false;
            //etc.
        }
        return mark;
    }
};


thrust::copy_if(indices.begin(),
                indices.end(),
                thrust::make_zip_iterator(thrust::make_tuple(xi, thrust::counting_iterator<int>())),
                indicesEmptied.begin(),
                IsEmpty2(rawXi));

标签: thrust
1条回答
趁早两清
2楼-- · 2019-08-11 19:03

@phoad: you're right about the shared mem, it struck me after I already posted my reply, subsequently thinking that the cache probably will help me. But you beat me with your quick response. The if-statement however is executed in less than 5% of all cases, so either using shared mem or relying on the cache will probably have negligible impact on performance.

Tuples only support 10 values, so that would mean I would require tuples of tuples for the 26 values in the 3D case. Working with tuples and zip_iterator was already quite cumbersome, so I'll pass for this option (also from a code readability stand point). I tried your suggestion by directly using threadIdx.x etc. in the device function, but Thrust doesn't like that. I seem to be getting some unexplainable results and sometimes I end up with an Thrust error. The following program for example generates a 'thrust::system::system_error' with an 'unspecified launch failure', although it first correctly prints "Processing 10" to "Processing 41":

struct printf_functor {
    __host__ __device__ void operator()(int e) {
        printf("Processing %d\n", threadIdx.x);
    }
};

int main() {
    thrust::device_vector<int> dVec(32);
    for (int i = 0; i < 32; ++i)
        dVec[i] = i + 10;

    thrust::for_each(dVec.begin(), dVec.end(), printf_functor());

    return 0;
}

Same applies to printing blockIdx.x Printing blockDim.x however generates no error. I was hoping for a clean solution, but I guess I am stuck with my current work-around solution.

查看更多
登录 后发表回答