I am trying to detect a circle in binary image using hough transform.
When I use Opencv's built-in function for the circular hough transform, it is OK and I can find the circle.
Now I try to write my own 'kernel' code for doing hough transform but is very very slow:
kernel void hough_circle(read_only image2d_t imageIn, global int* in,const int w_hough,__global int * circle)
{
sampler_t sampler=CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
int gid0 = get_global_id(0);
int gid1 = get_global_id(1);
uint4 pixel;
int x0=0,y0=0,r;
int maxval=0;
pixel=read_imageui(imageIn,sampler,(int2)(gid0,gid1));
if(pixel.x==255)
{
for(int r=20;r<150;r+=2)
{
// int r=100;
for(int theta=0; theta<360;theta+=2)
{
x0=(int) round(gid0-r*cos( (float) radians( (float) theta) ));
y0=(int) round(gid1-r*sin( (float) radians( (float) theta) ));
if((x0>0) && (x0<get_global_size(0)) && (y0>0)&&(y0<get_global_size(1)))
atom_inc(&in[w_hough*y0+x0]);
}
if(maxval<in[w_hough*y0+x0])
{
maxval=in[w_hough*y0+x0];
circle[0]=gid0;
circle[1]=gid1;
circle[2]=r;
}
}
}
}
There are source codes for the hough opencl library with opencv, but its hard to me for extract a specific function that helps me.
Can anyone offer a better source code example, or help me understand why this is so inefficient? the code main.cpp and kernel.cl compress in rar file http://www.files.com/set/527152684017e use opencv lib for read and display image >
Making repeated calls to
sin()
andcos()
is computationally expensive. Since you only ever call these functions with the same 180 values oftheta
, you could speed things up by precalculating these values and storing them in an array.A more robust approach would be to use the midpoint circle algorithm to find the perimeters of these circles by simple integer arithmetic.
What you are doing is running a huge CPU block of code in only 1 workitem, the results as expected, is a slowww kernel.
Detailed answer: The only place were you use the work-item ID is just for the pixel value, if that condition is met then you run a big chunck of code. Some of the work-items will trigger this some of them don't. The ones that trigger it will make indirectly all the work group to run that code, and this will slow you down.
In addition, the workitems that don't enter that condition will be idle. Depending on the image maybe 99% of them are idle.
I would rewrite your algorithm to use 1 workgroup per pixel. If the condition is met the workgroup will run the algorithm, if it is not, the whole workgroup will skip. And in the case the workgroup enters the condition, you will have many workitems to play with. This will allow a redesign of the code such that the inner for loops run in parallel.