Open CL Kernel - every workitem overwrites global

2019-08-02 02:36发布

问题:

I'm trying to write a kernel to get the character frequencies of a string.

First, here is the code I have for kernel right now:

_kernel void readParallel(__global char * indata, __global int * outdata)
{
        int startId = get_global_id(0) * 8;
        int maxId = startId + 7;

        for (int i = startId; i < maxId; i++)
        {
            ++outdata[indata[i]];
        }
 }

The variable inData holds the string in the global memory, and outdata is an array of 256 int values in the global memory. Every workitem reads 8 symbols from the string and should increase the appropriate ASCII-code in the array. The code compiles and executes, but outdata contains less occurrences overall than the number of characters in inData. I think the problem is that workitems overwrites the global memory. It would be nice if you can give me some tips to solve this.

By the way,. I am a rookie in OpenCL ;-) and, yes, I looked for solutions in other questions.

回答1:

You are experiencing the effects of your uses of global memory not being atomic (C++-oriented description of what those are or another description by the Intel TBB folks). What happens, chronologically, is:

Some workgroup "thread" loads outData[123] into some register r1

... lots of work, reading and writing, happens, including on outData[123]...

The same workgroup "thread" increments r1

... lots of work, reading and writing, happens, including on outData[123]...

The same workgroup "thread" writes r1 to outData[123]

So, the value written to outData[123] "throws away" the updates during the time period between the read and the write (I'm ignoring the possibility of parallel writes corrupting each other rather than one of them winning out).

What you need to do is either:

  • Use atomic operations - the least amount of modifications to your code, but very inefficient, since it serializes your work to a great extent, or
  • Use work-item-specific, warp-specific and/or work-group-specific partial results, which require less/cheaper synchronization, and combine them eventually after having done a lot of work on them.

On an unrelated note, and as @huseyintugrulbuyukisik correctly points out, your code uses signed char values to index the array. To fix that, do one of the following:

  • reinterpret those char's as unsigned chars for array indices (and reinterpret back when reading the array.
  • upcast the char values to a larger integral type and add 128 to get an offset into the outArray.
  • Define your kernel to only support ASCII characters (no higher than 127), in which case you can ignore this issue (although that will be a potential crasher if you get invalid input.
  • If you only care about the frequency of printable characters (but can also have non-printing characters in the input), you could perform a run-time check before counting a character.