I should count the number of set bits of a __m128i register. In particular, I should write two functions that are able to count the number of bits of the register, using the following ways.
- The total number of set bits of the register.
- The number of set bits for each byte of the register.
Are there intrinsic functions that can perform, wholly or partially, the above operations?
Here is a version base on Bit Twiddling Hacks - Counting Set Bits in Parallel with naming similar to other intrinsic functions as well as some extra functions for 16 32 and 64 bit vectors
Here are some codes I used in an old project (there is a research paper about it). The function
popcnt8
below computes the number of bits set in each byte.SSE2-only version (based on Algorithm 3 in Hacker's Delight book):
SSSE3 version (due to Wojciech Mula):
XOP version (equivalent to SSSE3, but uses XOP instructions which are faster on AMD Bulldozer)
Function
popcnt64
below counts the number of bits in the low and high 64-bit parts of the SSE register:SSE2 version:
XOP version:
Finally, the function
popcnt128
below count the number of bits in the whole 128-bit register:However, a more efficient way to implement
popcnt128
is to use hardware POPCNT instruction (on processors which support it):Edit: I guess I didn't understand what the OP was looking for, but I am keeping my answer up in case it is useful to anyone else stumbling across this.
C provides some nice bitwise operations.
Here is code to count the number of bits set in an integer:
Explanation:
Returns the last bit in our integer. (By dividing by two and checking the remainder). We add this to our total count, and then shift the bits of our toCount value by one. This operation should be continued until there are no more bits set in toCount (when toCount is equal to 0)
To count the number of bits in a specific byte, you will want to use a mask. Here is an example:
Lets say that in our system, we consider byte 0 the least significant byte in a little endian system. We want to create a new toCount to pass to our earlier countBitsSet function by masking out the bits that are set to 0. We do this by shifting a byte full of ones (denoted by the letter F) to the position we want (byteNumber * 8 for 8 bits in a byte) and performing a bitwise AND operation with our toCount variable.
As said in the first comment, gcc 3.4+ offers an easy access to a (hopefully optimal) built-in via
As stated here: http://gcc.gnu.org/onlinedocs/gcc-3.4.3/gcc/Other-Builtins.html#Other%20Builtins
Does not exactly answer the question for 128bits, but give a nice answer to the question I had when I landed here :)