This answer ended up a lot longer than I was planning, kind of a tutorial on writing efficient asm. i.e. how to make a simple problem complicated.
In case anyone's interested in a code-review of the attempted implementation, and a version with a lot of asm tricks:
There's so many small ways this could be better, e.g. keep 5
in bh
and 3
in bl
. You don't always have to use div bl
. AMD64 has 20 one-byte registers. (al/ah, bl/bh, cl/ch, dl/dh (no REX), and sil, dil, ... r15b (REX required)).
Using a 16bit counter is at least a waste of bytes (operand-size prefixes), and can cause slowdowns. Using mov reg,0
is bad. Put the the conditional branch at the bottom of the loop whenever possible.
mov rax, 1
is a waste of instruction bytes compared to mov eax, 1
, and this is tagged yasm, which doesn't optimize that for you at assemble time. (nasm
itself does.) Setting up 64bit registers and then using the int 0x80
32bit compatibility ABI is even more silly.
Storing the 16bit counter into memory is silly in the first place, but storing it to an address where you've only reserved one byte is asking for trouble.
Besides the small stuff, FizzBuzz(3,5)
is small enough to unroll and avoid some div
s entirely. With assembler macros, you could easily produce a fully-unrolled loop with LCM(fizz,buzz) outputs per loop (i.e. 15 in this case); enough for the pattern to repeat so you don't need any conditionals.
You can avoid div
without unrolling by using down-counters to detect count%5==0
and count%3==0
. @anatolyg's 16bit DOS code-golf FizzBuzz does that. This is a really common technique to do something every N times something else happens. e.g. performance counter events work this way.
Here's my attempt at an efficient FizzBuzz (for AMD64 Linux), using no libraries. only write(2)
and exit_group(2)
There's no compiler, so if you want good code, you have to write it yourself. You can't just hope the compiler will do something good with i%3
in a loop (which it doesn't anyway, for most compilers).
The code evolved a lot as I was writing it. As usual, starting to implement one way gives you better ideas when you see that your first idea takes more or slower instructions than you hoped.
I unrolled by 3 (Fizz) to remove all checks of counter%3
. I handled counter%5
checks with a countdown from 5 instead of division. This still takes a fair bit of logic that would go away with a full unroll to the point where the pattern repeats (LCM(3,5)). The integer to ASCII digits code could be in a function, or inlined into the unrolled loop for very bloated code.
I keep everything in registers (including the constants fizz\n
and buzz\n
). There are no loads, and stores only into the buffer. Many of the registers are set once outside the loop, rather than with a mov
-immediate right before use. This requires good comments to keep track of what you put where.
I append characters to a buffer which we write(2)
after every fizzbuzz\n
line. This is the longest cycle that occurs naturally in the logic of the program, and means we only need the syscall
code in one place.
In a real program that might write to a file or pipe, it would be better to use C stdio's strategy of using a much larger buffer in that case. (Many ~100 byte writes are much worse than fewer 4096B writes.) Still, I thought this was an interesting choice between the traditional printf every iteration or accumulating the entire string into one big buffer. I used a static buffer instead of reserving stack space because I'm writing a whole program, not a function that should avoid wasting memory after returning. Also, it let me use 32bit operand-size for pointer increments, to save code bytes (REX prefixes).
It would be pretty easy to accumulate multiple blocks, until you get to a point where the next group might go past the end of the buffer. i.e. compare the current position against buffer_end - BUZZMOD*FIZZMOD*9
. Optimizing I/O system calls is obviously a broad topic, and this version is sufficient to demonstrate accumulating a string in a buffer.
; for (count=1..100):
; if(count%3 == 0) { print_fizz(); }
; if(count%5 == 0) { print_buzz(); } else {
; if(count%3 && count%5) print(count);
;; }
; print(newline)
; We don't need pointers to these strings at all; The strings are immediate data for a couple mov instructions
;SECTION .rodata ; put constants in .rodata.
; fizz: db "fizz" ; No idea what the trailing 4 was for
; buzz: db "buzz"
FIZZMOD equ 3 ; only 3 works, but it would be easy to use a loop
BUZZMOD equ 5 ; any value works
LASTCOUNT equ 100 ; max 100: we only handle two decimal digits.
; TODO: cleanup that can handle LASTCOUNT%FIZZMOD != 1 and LASTCOUNT%BUZZMOD != 0
SECTION .bss
;;; generate a string in this buffer. (flush it with write(2) on "fizzbuzz" lines)
; buf: resb 4096
buf: resb FIZZMOD * BUZZMOD * 9 ; (worst case: every line is "fizzbuzz\n")
SECTION .text
global _start
_start:
; args for write(2). (syscall clobbers rcx/r11, and rax with the return value)
mov edi, 1 ; STDOUT_FILENO. also happens to be __NR_write in the AMD64 Linux ABI
mov esi, buf ; static data lives in the low 2G of address space, so we don't need a 64bit mov
;; edx = count. ; calculated each iteration
;; mov eax, edi ; also needed every time. saves 3B vs mov eax, imm32
; 'fizz' is only used once, so we could just store with an immediate there. That wouldn't micro-fuse, and we'd have to do the newline separately
mov r10b, 10 ; base 10
;;mov r14d, BUZZMOD ; not needed, we don't div for this
mov r12, 'fizz' | 10<<32 ; `fizz\n`, but YASM doesn't support NASM's backquotes for \-escapes
mov r13, 'buzz' | 10<<32 ; `buzz\n`. When buzz appears, it's always the end of a line
;;;;;;;; Set up for first iteration
mov ebp, BUZZMOD ; detect count%BUZZMOD == 0 with a down-counter instead of dividing
mov ebx, 1 ; counter starts at 1
mov edx, esi ; current output position = front of buf
ALIGN 16
main_loop:
;; TODO: loop FIZZMOD-1 times inside buzz_or_number, or here
;; It doesn't make much sense to unroll this loop but not inline buzz_or_number :/
call buzz_or_number
inc ebx
call buzz_or_number
add ebx, 2 ; counter is never printed on Fizz iterations, so just set up for next main_loop
;; Fizz, and maybe also Buzz
mov qword [rdx], r12 ; Fizz with a newline
add edx, 5 ; TODO: move this after the branch; adjust the offsets in .fizzbuzz
dec ebp
jz .fizzbuzz
;;.done_buzz: ; .fizzbuzz duplicates the main_loop branch instead of jumping back here
cmp ebx, LASTCOUNT-FIZZMOD
jbe main_loop
;;;;;;;;;; END OF main_loop
.cleanup:
;;;;;;;;;;;;;;;;;;;;; Cleanup after the loop
; hard-code the fact that 100 % FIZZMOD = 1 more line to print,
; and that 100 % BUZZMOD = 0, so the line is "buzz\n"
mov eax, edi ; __NR_write
mov [rdx], r13 ; the final "buzz\n".
sub edx, buf - 5 ; write_count = current_pos+5 - buf.
syscall ; write(1, buf, p - buf).
;; if buf isn't static, then use add edx, 5 / sub edx, esi
xor edi, edi
mov eax, 231 ; exit_group(0). same as eax=60: exit() for a single-threaded program
syscall
;;;;; The fizzbuzz case from the loop
.fizzbuzz:
;; count%BUZZMOD == 0: rdx points after the \n at the end of fizz\n, which we need to overwrite
;; this is a macro so we can use it in buzz_or_number, too, where we don't need to back up and overwrite a \n
%macro BUZZ_HIT 1
mov [rdx - %1], r13 ; buzz\n. Next line will overwrite the last 3 bytes of the 64b store.
add edx, 5 - %1
mov ebp, BUZZMOD ; reset the count%BUZZMOD down-counter
%endmacro
BUZZ_HIT 1 ; arg=1 to back up and overwrite the \n from "fizz\n"
sub edx, esi ; write_count = current_pos - buf
mov eax, edi ; __NR_write
syscall ; write(1, buf, p - buf). clobbers only rax (return value), and rcx,r11
mov edx, esi ; restart at the front of the buffer
;;; tail-duplication of the main loop, instead of jmp back to the cmp/jbe
;;; could just be a jmp main_loop, if we check at assemble time that LASTCOUNT % FIZZMOD != 0 || LASTCOUNT % BUZZMOD != 0
cmp ebx, LASTCOUNT-FIZZMOD
jbe main_loop
jmp .cleanup
;;;;;;;;;;;;;;;;;;;;;;; buzz_or_number: called for non-fizz cases
; special calling convention: uses (without clobbering) the same regs as the loop
;; modifies: BUZZMOD down-counter, output position pointer
;; clobbers: rax, rcx
ALIGN 32
buzz_or_number:
dec ebp
jnz .no_buzz ; could make this part of the macro, but flow-control inside macros is probably worse than duplication
;; count%BUZZMOD == 0: append "buzz\n" to the buffer and reset the down-counter
BUZZ_HIT 0 ; back up 0 bytes before appending
ret
.no_buzz: ;; get count as a 1 or 2-digit ASCII number
;; assert(ebx < 10); We don't handle 3-digit numbers
mov eax, ebx
div r10b ; al = count/10 (first (high) decimal digit), ah = count%10 (second (low) decimal digit).
;; x86 is little-endian, so this is in printing-order already for storing eax
;movzx eax, ax ; avoid partial-reg stalls on pre-Haswell
;; convert integer digits to ASCII by adding '0' to al and ah at the same time, and set the 3rd byte to `\n`.
cmp ebx, 9 ; compare against the original counter instead of the div result, for more ILP and earlier detection of branch misprediction
jbe .1digit ; most numbers from 1..100 are 2-digit, so make this the not-taken case
add eax, 0x0a3030 ;; `00\n`: converts 2 integer digits -> ASCII
;; eax now holds the number + newline as a 3-byte ASCII string
mov [rdx], eax
add edx, 3
ret
.1digit:
;; Could use a 16bit operand-size here to avoid partial-reg stalls, but an imm16 would LCP-stall on Intel.
shr eax, 8 ; Shift out the leading 0
add eax, 0x000a30 ;; 1-digit numbers
;; eax now holds the number + newline as a 2-byte ASCII string
mov [rdx], ax
add edx, 2
ret
This is how it runs:
$ strace ./fizzbuzz > /dev/null
execve("./fizzbuzz", ["./fizzbuzz"], [/* 69 vars */]) = 0
write(1, "1\n2\nfizz\n4\nbuzz\nfizz\n7\n8\nfizz\nbu"..., 58) = 58
write(1, "16\n17\nfizz\n19\nbuzz\nfizz\n22\n23\nfi"..., 63) = 63
write(1, "31\n32\nfizz\n34\nbuzz\nfizz\n37\n38\nfi"..., 63) = 63
write(1, "46\n47\nfizz\n49\nbuzz\nfizz\n52\n53\nfi"..., 63) = 63
write(1, "61\n62\nfizz\n64\nbuzz\nfizz\n67\n68\nfi"..., 63) = 63
write(1, "76\n77\nfizz\n79\nbuzz\nfizz\n82\n83\nfi"..., 63) = 63
write(1, "91\n92\nfizz\n94\nbuzz\nfizz\n97\n98\nfi"..., 40) = 40
exit_group(0) = ?
Correctness check:
./fizzbuzz | diff - <(perl -E'say((fizz)[$_%3].(buzz)[$_%5]or$_)for+1..100')
# no output = no difference
Unrolling over Buzz (5) and using a down-counter for Fizz would probably be worse. My version has a 64bit store of fizz\n\0\0\0
and then a branch to decide whether to store buzz\n\0\0\0
overlapping to produce fizzbuzz\n
. The other way would have a branch to decide whether to store fizz
(no newline needed, so it can be a 32bit store). Then it would unconditionally store buzz\n\0\0\0
. However, with FIZZMOD being smaller than BUZZMOD, it means more frequent resets of the down-counter, and more checks to see if we need to print a number instead of a string this iteration. Having every third line known to be fizz\n
or fizzbuzz\n
means the simpler code for that runs more often.
If overlapping stores are a problem, this whole algorithm is screwed, and this is just one of many. Also, we could just branch before storing fizz\n
and adding 5. Then in the fizzbuzz\n
case, we do two stores and add 9. This also separates the dec/jcc from the cmp/jcc at the bottom of main_loop
, so they can hopefully both macro-fuse on pre-Haswell. IIRC, some CPUs have branch predictors that really don't like multiple branches really close to each other.
Further improvements, left as an exercise for the reader:
Inline buzz_or_number
, and maybe turn it into a loop (FIZZMOD-1 iterations)
Besides that, it could probably branch less, and other minor improvements. This is sort of a version 1.1: working, tested, with some comments and observations added while writing up this answer, but not actually improving the code much from what I initially decided was good enough to see if it worked.
Make it more flexible by writing a cleanup loop (or assembler macros) for the last LASTCOUNT % FIZZMOD
lines, instead of assuming that it's 1 line. Cleanup code is the downside to unrolling.
I used a div
by 10 to convert the counter to a string. A better implementation would use a multiplicative inverse, like compilers generate for small constant divisors (implemented in this case with LEA).
Another strategy worth considering is a strength-reduction to incrementing a sequence of ASCII digits (stored in a register). This technique would be harder to extend to numbers with more digits. Storing them in printing order (most significant digit in the low byte) makes carry between digits work against us instead of for us. (e.g. if they were in natural order, you could add eax, 256-10
to correct the low digit and increment the high digit via carry.) It might be worth it to keep it this way, but BSWAP to store. Keeping a \n
embedded in the register so it only takes one store might not be worth it. Detecting and handling a 1 digit number becoming a 2 digit number is bad enough.
In 32bit mode, we could maybe use the AAA
instruction to do decimal carry after incrementing. However, despite the mnemonic, it works on BCD (0-9
), not ASCII ('0'-'9'
), and doesn't appear to make it easy to propagate the carry to a 3rd digit. No wonder AMD removed it for AMD64. It checks the AF
flag to detect carry out of the low 4 bits, but that's only useful for DAA
, where you have two BCD digits packed into a single byte, and when you're adding unknown values, not incrementing. In this case, you just check al >= 10
.
My first version of this almost worked the first time (after fixing a couple syntax errors so it would assemble, and a silly crash that took a couple minutes to debug IIRC): It printed fizz\nbuzz\n
in the fizzbuzz\n
case, and it reversed the digits. I keep forgetting that digit-strings need to be stored with the most-significant digit first, not like the bytes in a little-endian binary integer.
Alternate ways to do things
I decided not to use a branchless version of the 1 digit vs. 2 digit ASCII conversion code, since it took a lot of instructions. Also, the branch should predict very well.
;; Untested
buzz_or_number:
...
.no_buzz:
...
div r10b
DECIMAL_TO_ASCII_NEWLINE_2DIGIT equ 0x0a3030 ; add '0' to two unpacked decimal digits, and a newline
DECIMAL_TO_ASCII_NEWLINE_1DIGIT equ 0x000a30
;; hoist this out of the loop: mov r15d, DECIMAL_TO_ASCII_NEWLINE_2DIGIT - DECIMAL_TO_ASCII_NEWLINE_1DIGIT
xor ecx,ecx
cmp ah, 1 ; set CF if ah=0 (1 digit number), otherwise clear it. This allows sbb for a conditional add, instead of setcc
cmovae ecx, r15d ; 0 or the difference from 1digit to 2digit
lea eax, [rax+rcx + DECIMAL_TO_ASCII_NEWLINE_1DIGIT] ; rax+=0x0a3030 or 0x000a30, without clobbering flags
mov [rdx], eax
sbb edx, -3 ; add 2 (-(-3) - 1) or 3.
ret
In 32bit (and 16bit) mode, there's a div
instruction that takes an immediate operand, and uses AL
as the dividend, not AX
. It's called AAM
, and was removed for AMD64 along with the other BCD/ASCII instructions. It's convenient for testing divisibility by 5 without tying up a register for the divisor or wasting an instruction inside the loop. It's slightly faster than div r/m8
, and sets flags according to the remainder (in al
: it has its outputs reversed, compared to div
).
Anatolyg's golfed FizzBuzz uses AAM in a loop with shr ax, 8
to generate one digit at a time in reverse order, storing and decrementing a pointer.
This version is way more complicated because it's using AAM
to check for count%5 and then processing that into count%10, instead of doing a separate division to get ASCII digits.
;; Untested
buzz_or_number_div:
mov eax, ebx
aam 5 ; al = al%5 ah = al/5. (opposite locations from div), and sets flags according to the remainder.
jz print_buzz ; tailcall
; fall through into print_counter
;print_counter:
; maybe use the result of div by 5 to get division by 10?
; shifting the low bit of the quotient into bit 4 of the remainder should be faster than dividing again.
;; after AAM: ah = 5bit quotient (qqqqQ), al = 3bit remainder(RRR)
;; starting point: ; AX = [ 000qqqqQ 00000RRR ]
;; desired = byte swapped as well: [ 0000QRRR 0000qqqq ]
shl al, 5 ; AX = [ 000qqqqQ RRR00000 ]
shr ax, 1 ; AX = [ 0000qqqq QRRR0000 ]
ror ax, 8 ; AX = [ QRRR0000 0000qqqq ] ; simple byte-swap
shr ah, 4 ; AX = [ 0000QRRR 0000qqqq ]
add eax, ...; convert to ascii
...
ret
; those instructions are all single-uop 1c latency on SnB-family, but pre-Haswell will insert extra merging uops. (And stall while doing so, on pre-SnB).
; and there's another partial-reg stall when we read eax
; It might be possible to do this bit manipulation with fewer operations, or maybe different ones. (maybe copy ax to cx, so we can move from cl or ch to al or ah?)
; shr ah, 1 ; AX = [ 0000qqqq 00000RRR ] CF=Q ; then what? setc/shift/or? rcl is slow, too.
; rorx eax, eax, 32-4 ; AX = [ qqqq0000 0RRR0000 ] CF=Q
; nope, seems a dead end
; shl ah, 3 ; AX = [ qqqqQ000 00000RRR ]
; ror ax, 7 ; AX = [ 0000RRRq qqqQ0000 ]
; shr al, 4 ; AX = [ 0000RRRq 0000qqqQ ]
; oops, no, shifts the wrong way.
; shl ah, 3 ; AX = [ qqqqQ000 00000RRR ]
; or ah, al ; AX = [ qqqqQRRR 00000RRR ]
; xor al,al ; AX = [ qqqqQRRR 00000000 ]
; rol ax, 4 ; AX = [ QRRR0000 0000qqqq ]
; shr ah, 4 ; AX = [ QRRR0000 qqqq0000 ]
; only 3 shifts, but still partial-reg heavy. Interesting on Haswell
; ror ax, 9 ; AX = [ Q00000RR R000qqqq ] CF=Q