OpenCL select/delete points from large array

2019-02-25 00:42发布

I have an array of 2M+ points (planned to be increased to 20M in due course) that I am running calculations on via OpenCL. I'd like to delete any points that fall within a random triangle geometry.

How can I do this within an OpenCL kernel process?

I can already:

  • identify those points that fall outside the triangle (simple point in poly algorithm in the kernel)

  • pass their coordinates to a global output array.

But:

  • an openCL global output array cannot be variable and so I initialise it to match the input array of points in terms of size

  • As a result, 0,0 points occur in the final output when a point falls within the triangle

  • The output array therefore does not result in any reduction per se.

Can the 0,0 points be deleted within the openCL context?

n.b. I am coding in OpenFrameworks, so c++ implementations are linking to .cl files

标签: opencl
3条回答
家丑人穷心不美
2楼-- · 2019-02-25 00:58

There are alternatives, all working better or worse, depending on how the data looks like. I put one below.

Deleting the identified points can also be done by registering them in a separate array per workgroup - you need to use the same atomic_inc as with Moises's answer (see my remark there about doing this at workgroup-level!!). The end-result is a list of start-points and end-points of parts that don't need to be deleted. You can then copy parts of the array those by different threads. This is less effective if you have clusters of points that need to be deleted

查看更多
甜甜的少女心
3楼-- · 2019-02-25 01:00

If I understood your problem, you can do:

--> In your kernel, you can identify the points in the triangle and:

if(element[idx]!=(0,0))
      output_array[atomic_inc(number_of_elems)] = element[idx];

Finally, in first number_of_elems of output_array in the host you will have your inner points.

I hope this help you, Best

查看更多
神经病院院长
4楼-- · 2019-02-25 01:07

Just an alternative for the case where most of the points fall inside the atomic condition:

It is possible to have a local counter, and local atomic. Then to merge that atomic to the global value it is possible to use atomic_add(). Witch will return the "previous" global value. So, you just copy the indexes to that address and up.

It should be a noticeable speed up, since the threads will sync locally and only once globally. The global copy can be parallel since the address will never overlap.

For example:

__kernel mykernel(__global MyType * global_out, __global int * global_count, _global MyType * global_in){
   int lid = get_local_id(0);
   int lws = get_local_size(0);
   int idx = get_global_id(0);

   __local int local_count;
   __local int global_val;    
   //I am using a local container, but a local array of pointers to global is possible as well
   __local MyType local_out[WG_SIZE]; //Ensure this is higher than your work_group size
   if(lid==0){
      local_count = 0; global_val = -1;
   }
   barrier(CLK_LOCAL_MEM_FENCE);

   //Classify them
   if(global_in[idx] == ....)
       local_out[atomic_inc(local_count)] = global_in[idx];

   barrier(CLK_LOCAL_MEM_FENCE);

   //If not, we are done
   if(local_count > 0){
      //Only the first local ID does the atomic to global
      if(lid == 0)
         global_val = atomic_add(global_count,local_count);

      //Resync all the local workers here
      barrier(CLK_LOCAL_MEM_FENCE);

      //Copy all the data
      for(int i=0; i<local_count; i+=lws)
          global_out[global_val+i] = local_out[i];
   }
}

NOTE: I didn't compile it but should more or less work.

查看更多
登录 后发表回答