可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
So I had an interview question before regarding bit manipulation. The company is a well known GPU company. I had very little background in assembly language (weird despite being a phd student in computer architecture) and as this narrative would indicate, I botched it. The question was a simple:
"Write a fast code that will count the number of 1's in a 32-bit register."
Now I am in the process of studying arm assembly. So naturally I revisited this problem again and have come up with this code just by studying the ISA.
For you arm experts out there, is this correct? Is there a faster way to do this? Being a beginner, I naturally think this is incomplete. The AND instruction in "xx" feels redundant but there is no other way to shift a register in ARM isa...
R1 will contain the number of bits at the end while R2 is the register with bits we want to count. r6 is just a dummy register. Comments are enclosed in ()
MOV R1, #0 (initialize R1 and R6 to zero)
MOV R6, #0
xx: AND R6, R6, R2, LSR #1 (Right shift by 1, right most bit is in carry flag)
ADDCS R1, #1 (Add #1 to R1 if carry flag is set)
CMP R2, #0 (update the status flags if R2 == 0 or not)
BEQ xx (branch back to xx until R2==0)
回答1:
You could use a precomputed look up table and reduce the number of iterations to 2 or 4.
You could also use the logarithmic approach.
For more info see this Wikipedia article.
回答2:
If this code is fast or not depends on the processor. For sure it will be not very fast on Cortex-A8 but may run very fast on Cortex-A9 and newer CPU.
It is however a very short solution.
Expects input in r0, and returns output in r0
vmov.32 d0[0], r0
vcnt.8 d0, d0
vmov.32 r0, d0[0]
add r0, r0, r0, lsr #16
add r0, r0, r0, lsr #8
and r0, r0, #31
The main work is done in the vcnt.8 instruction which counts the bits of each byte in a NEON register and stores the bitcount back into the bytes of D0.
There is no vcnt.32
form, only .8
, so you need to horizontally add the 4 bytes together, which is what the rest of the code is doing.
回答3:
Since this is tagged ARM, the clz
instruction is most helpful. The problem is also described as a population count. gcc
has __builtin_popcount() for this. As does the ARM tools. There is this link (don't feel bad about your solution, some one made a web page with nearly the same) and also there is Dave Seal's version with six instruction for non-clz
ARMs. The clz
is advantageous and can be used to produce a faster algorithm, depending on the input.
As well as auselen's good reading suggestion,Hacker's Delight this bit twiddling blog maybe helpful which is talking about such things in a graphic context. At least I found it useful to understand some of Qt's blitting code. However, it has some usefulness in coding a population count routine.
The carry add
unit is useful in a divide and conquer sense, making the problem O(ln n)
. clz
is more useful if the data has runs of ones or zeros.
The Hacker's Delight entry has more background on Dave Seal's ARM code.
回答4:
Best references for bit hacks are
- Hacker's Delight (print)
- Bit Twiddling Hacks (online)
Bit Twiddling Hacks
page says
The best method for counting bits in a 32-bit
integer v is the following:
v = v - ((v >> 1) & 0x55555555); // reuse input as temporary
v = (v & 0x33333333) + ((v >> 2) & 0x33333333); // temp
c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24; // count
Then I would suggest you to use gcc
and objdump
(or this great online gcc tool) to see how this high level snippet would look like as arm instructions.
00000000 <popcount>:
0: 1043 asrs r3, r0, #1
2: f003 3355 and.w r3, r3, #1431655765 ; 0x55555555
6: 1ac0 subs r0, r0, r3
8: 1083 asrs r3, r0, #2
a: f000 3033 and.w r0, r0, #858993459 ; 0x33333333
e: f003 3333 and.w r3, r3, #858993459 ; 0x33333333
12: 18c0 adds r0, r0, r3
14: eb00 1010 add.w r0, r0, r0, lsr #4
18: f000 300f and.w r0, r0, #252645135 ; 0xf0f0f0f
1c: eb00 2000 add.w r0, r0, r0, lsl #8
20: eb00 4000 add.w r0, r0, r0, lsl #16
24: 1600 asrs r0, r0, #24
26: 4770 bx lr
So it looks like this gives you result in 12
instructions, which roughly can translate to same amount of cycles.
Comparing integer twiddling above to look up table
approach used by libgcc, look up table should be even slower considering extra memory accesses.
00000028 <__popcountSI2>:
28: b410 push {r4}
2a: 2200 movs r2, #0
2c: 4c06 ldr r4, [pc, #24] ; (48 <__popcountSI2+0x20>)
2e: 4613 mov r3, r2
30: fa40 f103 asr.w r1, r0, r3
34: 3308 adds r3, #8
36: 2b20 cmp r3, #32
38: b2c9 uxtb r1, r1
3a: 5c61 ldrb r1, [r4, r1]
3c: 440a add r2, r1
3e: d1f7 bne.n 30 <__popcountSI2+0x8>
40: 4610 mov r0, r2
42: bc10 pop {r4}
44: 4770 bx lr
46: bf00 nop
48: 00000000 andeq r0, r0, r0
<.. snipped ..>
回答5:
long count_bits_long (long);
vmov.32 d0[0], r0 // R0 --> SIMD
vcnt.8 d0, d0 // count bits in bytes
vpaddl.u8 d0, d0 // add adjacent pairs of bytes and put into 16b words
vpaddl.u16 d0, d0 // add adjacent pairs of 16b words and put into 32b word
vmov.32 r0, d0[0] // SIMD --> R0
mov pc, lr // return
回答6:
LDR r0, = 0x000000FF;
MOV r1, #0;
MOV r3, #0; this will always be zero
MOV r2,r0;
rep MOVS r2, r2, LSR #1;
ADC r1,r1, r3; this adds r1 with zero plus the carry bit
CMP r2, #0;
BNE rep
This will do it, r3 is just a dummy register with 0 to make ADC work properly.