I'm trying to count how many number 1, are in the numbers of an array.
First I have a code in C lenguaje(work ok):
int popcount2(int* array, int len){
int i;
unsigned x;
int result=0;
for (i=0; i<len; i++){
x = array[i];
do{
result+= x & 0x1;
x>>= 1;
} while(x);
}
return result;
}
Now I need to translate the do-while loop into Assembly using 3-6 lines of code. I have write some code but the result is not correct.( I'm new in the assembly world )
int popcount3(int* array, int len){
int i;
unsigned x;
int result=0;
for (i=0; i<len; i++){
x = array[i];
asm(
"ini3: \n"
"adc $0,%[r] \n"
"shr %[x] \n"
"jnz ini3 \n"
: [r]"+r" (result)
: [x] "r" (x) );
}
}
I'm using GCC ( on linux) with an Intel processor.
This example uses a 4 instruction loop:
This example uses a 3 instruction loop, but it would be slower than the 4 instruction loop version on most processors.
This is an alternative non-looping example:
When I needed to create a popcount, I ended up using the 5's and 3's method from the Bit Twiddling Hacks @PaulR mentioned. But if I wanted to do this with a loop, maybe something like this:
adc
is almost always more efficient than branching on CF."=r" (junk)
is a dummy output operand that is in the same register asv
(the"1"
constraint). We're using this to tell the compiler that the asm statement destroys thev
input. We could have used[v] "+r"(v)
to get a read-write operand, but we don't want the C variablev
to be updated.Note that the loop trip-count for this implementation is the position of the highest set bit. (
bsr
, or32 - clz(v)
). @rcgldr's implementation which clears the lowest set bit every iteration will typically be faster when the number of set bits is low but they're not all near the bottom of the integer.You are starting out with a really inefficient algorithm - if you use a better algorithm then you may not need to waste time with assembler. See Hacker's Delight and/or Bit Twiddling Hacks for much more efficient methods.
Note also that newer x86 CPUs have a POPCNT instruction which does all of the above in one instruction (and you can call it via an intrinsic too, so no need for asm).
And finally gcc has a builtin:
__builtin_popcount
, which again does all you need - it will usePOPCNT
on newer CPUs and equivalent asm on older CPUs.The nicest think you can do is using built in
popcount
function as suggested by Paul R, but since you need to write it in assembly, this worked for me:At first two lines you just check the contents of
x
(and go to end if it's zeroJump Zero
). Than you shiftx
one bit to right and:If there's no
CF
set, just go to start (Jump Not Carry
) else increment result and then go to start.And the beautiful think about assembly is that you can do things in so many ways...
Here you again use
SHift Right
instruction, if noCF
is present just go to loop condition.Otherwise again increment result and call binary
AND
(INC
does modifyZF
).Using
LOOP
andECX
I was curious how to do this in 3 instruction (I figured your teacher wouldn't give you bottom limit of 3 if it wasn't possible) and I realized x86 also has
LOOP
instruction:And you can add input argument using GCCs input constrain:
Just to make sure it compiles to proper assembly:
I think the first one should have a bit better performance, especially if there's just 1 bit set, this approach always does
k*8
iterations.SSE4 and single instruction
I know you have to use a loop, but just for fun... With SSE4 extension you could do this by just one instruction
POPCNT
:I think (I have a quite old CPU on my notebook, so I can't test this for you) you should be able to do this with just one simple instruction:
(If you try this and you do have SSE4 extension, please let me know if it works)
Performance
I've measured times required to do 100,000,000 popcounts comparing my first and second methods with David Wohlferd's. [Raw data]
If anyone can compare these 3 with SSE4's
POPCNT
instruction I'd be glad.