I am trying to build a kernel to do parallel string search. To this end I tend to use a finite state machine. The transition table of the fsm is in the kernel argument states. The code:
__kernel void Find ( __constant char *text,
const int offset,
const int tlenght,
__constant char *characters,
const int clength,
const int maxlength,
__constant int *states,
const int statesdim){
private char c;
private int state;
private const int id = get_global_id(0);
if (id<(tlenght-maxlength)) {
private int cIndex,sd,s,k;
for (int i=0; i<maxlength; i++) {
c = text[i+offset];
cIndex = -1;
for (int j=0; j<clength; j++) {
if (characters[j]==c) {
cIndex = j;
}
}
if (cIndex==-1) {
state = 0;
break;
} else {
s = states[state+cIndex*statesdim];
}
if (state<=0) break;
}
}
}
If I compile this kernel using iocgui, I get the result:
Using default instruction set architecture.
Intel OpenCL CPU device was found!
Device name: Pentium(R) Dual-Core CPU T4400 @ 2.20GHz
Device version: OpenCL 1.1 (Build 31360.31426)
Device vendor: Intel(R) Corporation
Device profile: FULL_PROFILE
Build started
Kernel <Find> was successfully vectorized
Done.
Build succeeded!
When I change the line where the new state is determined to:
state = states[state+cIndex*statesdim];
The result is:
Using default instruction set architecture.
Intel OpenCL CPU device was found!
Device name: Pentium(R) Dual-Core CPU T4400 @ 2.20GHz
Device version: OpenCL 1.1 (Build 31360.31426)
Device vendor: Intel(R) Corporation
Device profile: FULL_PROFILE
Build started
Kernel <Find> was not vectorized
Done.
Build succeeded!
The statement
cannot be vectorized since the index is not necessarily evaluates to accesses to consequent bytes across threads.
Notice that in your first kernel, you have the destination variable
s
where it has not written back to global memory. Therefore, compiler may optimize the code and remove thes = states[state+cIndex*statesdim];
statement. Therefore, it looks your statement has been vectorized but it is not so.