I writing a fast "8 bit reverse"-routine for an avr-project with an ATmega2560 processor. I'm using
- GNU C (WinAVR 20100110) version 4.3.3 (avr) / compiled by GNU C version 3.4.5 (mingw-vista special r3), GMP version 4.2.3, MPFR version 2.4.1.
First I created a global lookup-table of reversed bytes (size: 0x100):
uint8_t BitReverseTable[]
__attribute__((__progmem__, aligned(0x100))) = {
0x00,0x80,0x40,0xC0,0x20,0xA0,0x60,0xE0,
0x10,0x90,0x50,0xD0,0x30,0xB0,0x70,0xF0,
[...]
0x1F,0x9F,0x5F,0xDF,0x3F,0xBF,0x7F,0xFF
};
This works as expected. That is the macro I intend to use, which should cost me only 5 cylces:
#define BITREVERSE(x) (__extension__({ \
register uint8_t b=(uint8_t)x; \
__asm__ __volatile__ ( \
"ldi r31, hi8(table)" "\n\t" \
"mov r30, ioRegister" "\n\t" \
"lpm ioRegister, z" "\n\t" \
:[ioRegister] "+r" (b) \
:[table] "g" (BitReverseTable) \
:"r30", "r31" \
); \
}))
The code to get it compiled (or not).
int main() /// Test for bitreverse
{
BITREVERSE(25);
return 0;
}
That's the error I get from the compiler:
c:/winavr-20100110/bin/../lib/gcc/avr/4.3.3/../../../../avr/bin/as.exe -mmcu=atmega2560 -o bitreverse.o C:\Users\xxx\AppData\Local\Temp/ccCefE75.s
C:\Users\xxx\AppData\Local\Temp/ccCefE75.s: Assembler messages:
C:\Users\xxx\AppData\Local\Temp/ccCefE75.s:349: Error: constant value required
C:\Users\xxx\AppData\Local\Temp/ccCefE75.s:350: Error: constant value required
I guess the problem is here:
:[table] "g" (BitReverseTable) \
From my point of view BitReverseTable is the memory position of the array, which is fixed and known at compile time. Therefor it is constant. Maybe I need to cast BitReverseTable into something (i tried anything I could think of). Maybe I need another constraint ("g" was my last test). I'm sure I used anything possible and impossible. I coded an assembler version, which works fine, but instead of being an inline assembly code, this is a proper function which adds another 6 cycles (for call and ret).
Any advice or suggestions are very welcome!
Full source of bitreverse.c on pastebin. Verbose compiler output also on pastebin
The following does seem to work on avr-gcc (GCC) 4.8.2, but it does have a distinct hacky aftertaste to me.
Edited to fix the issues pointed out by the OP (Thomas) in the comments:
Z
register isr31
(I hadr30
andr31
swapped)lpm r,Z
(older AVRs onlylpm r0,Z
)Thanks for the fixes, Thomas! I do have an ATmega2560 board, but I prefer Teensies (in part because of the native USB), so I only compile-tested the code, didn't run it to verify. I should have mentioned that; apologies.
For older AVRs that only support
lpm r0,Z
, useThe idea is that we use a local reg var
r31
, to keep the high byte of theZ
register pair. TheUSING_REVERSE_BITS;
macro defines it in the current scope, using inline assembly for two purposes: to avoid an unnecessary load of the low part of the table address into a register, and to make sure GCC knows we have stored a value into it (because it is an output operand) without having any way of knowing what the value should be, thus hopefully retaining it throughout the scope.The
REVERSE_BITS()
macro yields the result, telling the compiler it needs the argument in registerr30
, and the table address high byte set byUSING_REVERSE_BITS;
inr31
.Sounds a bit complicated, but that's just because I don't know how to explain it better. It really is quite simple.
Compiling the above with
avr-gcc-4.8.2 -O2 -fomit-frame-pointer -mmcu=atmega2560 -S
yields the assembly source. (I do recommend using-O2 -fomit-frame-pointer
.) Omitting comments and the normal directives:In case you are wondering, on ATmega2560 GCC puts the first 8-bit parameter and the 8-bit function result both in register
r24
.The first function is optimal, as far as I can tell. (On older AVRs that only support
lpm r0,Z
, you get an added move to copy the result fromr0
tor24
.)For the second function, the setup part might not be exactly optimal (for one, you could do the
tst r22
breq .L2
first thing to speed up the zero-length-array check), but I'm not sure if I could write a faster/shorter one myself; it's certainly acceptable to me.The loop in the second function looks optimal to me. The way it uses
r30
I found strange and scary at first, but then I realized it makes perfect sense -- fewer registers used, and there is no harm in reusingr30
this way (even if it is low part ofZ
register too), because it will be loaded with a new value fromstring
at the start of the next iteration.Note that in my previous edit, I mentioned that swapping the order of the function parameters yielded better code, but with Thomas's additions, that is no longer the case. The registers change, that's it.
If you are sure you always supply a larger-than-zero
length
, usingyields
which starts to look downright impressive to me: ten cycles per byte, four cycles for setup, and three cycles cleanup (
brne
takes just one cycle if no jump). The cycle counts I listed off the top of my head, so there are likely small errors in 'em (a cycle here or there).r26:r27
isX
, and the first pointer parameter to the function is supplied inr24:r25
, with length inr22
.The
reverse_bits_table
is in the correct section, and correctly aligned. (.p2align 8
does align to 256 bytes; it specifies an alignment where the low 8 bits are zero.)Although GCC is notorious for superfluous register moves, I really like the code it generates above. Sure, there is always room for finessing; for the important code sequences I recommend trying different variants, even changing the order of function parameters (or declaring loop variables in local scopes), and so on, then compile using
-S
to see the generated code. The AVR instruction timings are simple, so it is pretty easy to compare code sequences, to see if one is clearly better. I like to remove the directives and comments first; it makes it easier to read the assembly.The reason for the hacky aftertaste is that the GCC documentation explicitly says that "Defining such a register variable does not reserve the register; it remains available for other uses in places where flow control determines the variable's value is not live", and I just don't trust that this means the same to the GCC developers as it means to me. Even if it did right now, it might not in the future; there is no standard GCC developers ought to adhere to here, since this is a GCC-specific feature.
On the other hand, I do only rely on documented GCC behaviour, above, and although "hacky", it does generate efficient assembly from straightforward C code.
Personally, I would recommend recompiling the above test code, and looking at the generated assembly (perhaps use sed to strip out the comments and labels, and compare to a known good version?), whenever you update avr-gcc.
Questions?