I am given a array of lowercase characters (up to 1.5Gb) and a character c. And I want to find how many occurrences are of the character c using AVX instructions.
unsigned long long char_count_AVX2(char * vector, int size, char c){
unsigned long long sum =0;
int i, j;
const int con=3;
__m256i ans[con];
for(i=0; i<con; i++)
ans[i]=_mm256_setzero_si256();
__m256i Zer=_mm256_setzero_si256();
__m256i C=_mm256_set1_epi8(c);
__m256i Assos=_mm256_set1_epi8(0x01);
__m256i FF=_mm256_set1_epi8(0xFF);
__m256i shield=_mm256_set1_epi8(0xFF);
__m256i temp;
int couter=0;
for(i=0; i<size; i+=32){
couter++;
shield=_mm256_xor_si256(_mm256_cmpeq_epi8(ans[0], Zer), FF);
temp=_mm256_cmpeq_epi8(C, *((__m256i*)(vector+i)));
temp=_mm256_xor_si256(temp, FF);
temp=_mm256_add_epi8(temp, Assos);
ans[0]=_mm256_add_epi8(temp, ans[0]);
for(j=1; j<con; j++){
temp=_mm256_cmpeq_epi8(ans[j-1], Zer);
shield=_mm256_and_si256(shield, temp);
temp=_mm256_xor_si256(shield, FF);
temp=_mm256_add_epi8(temp, Assos);
ans[j]=_mm256_add_epi8(temp, ans[j]);
}
}
for(j=con-1; j>=0; j--){
sum<<=8;
unsigned char *ptr = (unsigned char*)&(ans[j]);
for(i=0; i<32; i++){
sum+=*(ptr+i);
}
}
return sum;
}
Probably the fastest: memcount_avx2 and memcount_sse2
Benchmark skylake i7-6700 - 3.4GHz - gcc 8.3:
memcount_avx2 : 28 GB/s
memcount_sse: 23 GB/s
char_count_AVX2 : 23 GB/s (from post)
If you don't insist on using only SIMD instructions, you can make use
of the VPMOVMSKB instruction in combination with the POPCNT instruction. The former combines the highest bits of each byte into a 32-bit integer mask and the latter counts the
1
bits in this integer (=the count of char matches).I haven't tested this solution, but you should get the gist.
I'm intentionally leaving out some parts, which you need to figure out yourself (e.g. handling lengths that aren't a multiple of
4*255*32
bytes), but your most inner loop should look something like the one starting withfor(int i...)
:_mm256_cmpeq_epi8
will get you a -1 in each byte, which you can use as an integer. If you subtract that from a counter (using_mm256_sub_epi8
) you can directly count up to 255 or 128. The inner loop contains just these two intrinsics. You have to stop andGodbolt link: https://godbolt.org/z/do5e3- (clang is slightly better than gcc at unrolling the most inner loop: gcc includes some useless
vmovdqa
instructions that will bottleneck the front-end if the data is hot in L1d cache, preventing us from running close to 2x 32-byte loads per clock)