I am trying to write FizzBuzz in Assembly and I am seeing segmentation fault all the time. So far I have determined that it is not my printing routines (because I have removed their contents and the problem persists) and the error hides somewhere in the main function.
I was getting this output when I run the program:
fizzSegmentation fault
Leading me to believe that it's not the problem with using division and looking up the remainders. But I could be wrong, I haven't done Assembly in two years...
SECTION .data
global _start
fizz: db "fizz", 4
buzz: db "buzz", 4
SECTION .bss
counter: resb 1
SECTION .text
_start:
mov ax,0
mov [counter],ax
main_loop:
cmp ax,100 ;from 0 to 100
je exit ;
mov bl,3 ;divisor
mov ah,0 ;here will be a remainder
div bl ;divide
cmp ah,0 ;compare the remainder with 0
je print_fizz ;print fizz if they equal
mov bl,5 ;new divisor
mov ah,0 ;do I have to do it every time?
div bl ;divide
cmp ah,0 ;compare the remainder with 0
je print_buzz ;print buzz if they equal
jmp print_ax ;print contents of ax if not
inc ax ;increment ax
jmp main_loop ;jump to label
print_ax:
ret
print_fizz:
ret
print_buzz:
ret
exit:
mov rax,1
mov rbx,0
int 80h
ret
I am compiling using:
yasm -f elf64 -o fizzbuzz.o fizzbuzz.asm
ld -d -o fizzbuzz fizzbuzz.o
This answer ended up a lot longer than I was planning, kind of a tutorial on writing efficient asm. i.e. how to make a simple problem complicated.
In case anyone's interested in a code-review of the attempted implementation, and a version with a lot of asm tricks:
There's so many small ways this could be better, e.g. keep
5
inbh
and3
inbl
. You don't always have to usediv bl
. AMD64 has 20 one-byte registers. (al/ah, bl/bh, cl/ch, dl/dh (no REX), and sil, dil, ... r15b (REX required)).Using a 16bit counter is at least a waste of bytes (operand-size prefixes), and can cause slowdowns. Using
mov reg,0
is bad. Put the the conditional branch at the bottom of the loop whenever possible.mov rax, 1
is a waste of instruction bytes compared tomov eax, 1
, and this is tagged yasm, which doesn't optimize that for you at assemble time. (nasm
itself does.) Setting up 64bit registers and then using theint 0x80
32bit compatibility ABI is even more silly.Storing the 16bit counter into memory is silly in the first place, but storing it to an address where you've only reserved one byte is asking for trouble.
Besides the small stuff,
FizzBuzz(3,5)
is small enough to unroll and avoid somediv
s entirely. With assembler macros, you could easily produce a fully-unrolled loop with LCM(fizz,buzz) outputs per loop (i.e. 15 in this case); enough for the pattern to repeat so you don't need any conditionals.You can avoid
div
without unrolling by using down-counters to detectcount%5==0
andcount%3==0
. @anatolyg's 16bit DOS code-golf FizzBuzz does that. This is a really common technique to do something every N times something else happens. e.g. performance counter events work this way.Here's my attempt at an efficient FizzBuzz (for AMD64 Linux), using no libraries. only
write(2)
andexit_group(2)
There's no compiler, so if you want good code, you have to write it yourself. You can't just hope the compiler will do something good with
i%3
in a loop (which it doesn't anyway, for most compilers).The code evolved a lot as I was writing it. As usual, starting to implement one way gives you better ideas when you see that your first idea takes more or slower instructions than you hoped.
I unrolled by 3 (Fizz) to remove all checks of
counter%3
. I handledcounter%5
checks with a countdown from 5 instead of division. This still takes a fair bit of logic that would go away with a full unroll to the point where the pattern repeats (LCM(3,5)). The integer to ASCII digits code could be in a function, or inlined into the unrolled loop for very bloated code.I keep everything in registers (including the constants
fizz\n
andbuzz\n
). There are no loads, and stores only into the buffer. Many of the registers are set once outside the loop, rather than with amov
-immediate right before use. This requires good comments to keep track of what you put where.I append characters to a buffer which we
write(2)
after everyfizzbuzz\n
line. This is the longest cycle that occurs naturally in the logic of the program, and means we only need thesyscall
code in one place.In a real program that might write to a file or pipe, it would be better to use C stdio's strategy of using a much larger buffer in that case. (Many ~100 byte writes are much worse than fewer 4096B writes.) Still, I thought this was an interesting choice between the traditional printf every iteration or accumulating the entire string into one big buffer. I used a static buffer instead of reserving stack space because I'm writing a whole program, not a function that should avoid wasting memory after returning. Also, it let me use 32bit operand-size for pointer increments, to save code bytes (REX prefixes).
It would be pretty easy to accumulate multiple blocks, until you get to a point where the next group might go past the end of the buffer. i.e. compare the current position against
buffer_end - BUZZMOD*FIZZMOD*9
. Optimizing I/O system calls is obviously a broad topic, and this version is sufficient to demonstrate accumulating a string in a buffer.This is how it runs:
Correctness check:
Unrolling over Buzz (5) and using a down-counter for Fizz would probably be worse. My version has a 64bit store of
fizz\n\0\0\0
and then a branch to decide whether to storebuzz\n\0\0\0
overlapping to producefizzbuzz\n
. The other way would have a branch to decide whether to storefizz
(no newline needed, so it can be a 32bit store). Then it would unconditionally storebuzz\n\0\0\0
. However, with FIZZMOD being smaller than BUZZMOD, it means more frequent resets of the down-counter, and more checks to see if we need to print a number instead of a string this iteration. Having every third line known to befizz\n
orfizzbuzz\n
means the simpler code for that runs more often.If overlapping stores are a problem, this whole algorithm is screwed, and this is just one of many. Also, we could just branch before storing
fizz\n
and adding 5. Then in thefizzbuzz\n
case, we do two stores and add 9. This also separates the dec/jcc from the cmp/jcc at the bottom ofmain_loop
, so they can hopefully both macro-fuse on pre-Haswell. IIRC, some CPUs have branch predictors that really don't like multiple branches really close to each other.Further improvements, left as an exercise for the reader:
Inline
buzz_or_number
, and maybe turn it into a loop (FIZZMOD-1 iterations)Besides that, it could probably branch less, and other minor improvements. This is sort of a version 1.1: working, tested, with some comments and observations added while writing up this answer, but not actually improving the code much from what I initially decided was good enough to see if it worked.
Make it more flexible by writing a cleanup loop (or assembler macros) for the last
LASTCOUNT % FIZZMOD
lines, instead of assuming that it's 1 line. Cleanup code is the downside to unrolling.I used a
div
by 10 to convert the counter to a string. A better implementation would use a multiplicative inverse, like compilers generate for small constant divisors (implemented in this case with LEA).Another strategy worth considering is a strength-reduction to incrementing a sequence of ASCII digits (stored in a register). This technique would be harder to extend to numbers with more digits. Storing them in printing order (most significant digit in the low byte) makes carry between digits work against us instead of for us. (e.g. if they were in natural order, you could
add eax, 256-10
to correct the low digit and increment the high digit via carry.) It might be worth it to keep it this way, but BSWAP to store. Keeping a\n
embedded in the register so it only takes one store might not be worth it. Detecting and handling a 1 digit number becoming a 2 digit number is bad enough.In 32bit mode, we could maybe use the
AAA
instruction to do decimal carry after incrementing. However, despite the mnemonic, it works on BCD (0-9
), not ASCII ('0'-'9'
), and doesn't appear to make it easy to propagate the carry to a 3rd digit. No wonder AMD removed it for AMD64. It checks theAF
flag to detect carry out of the low 4 bits, but that's only useful forDAA
, where you have two BCD digits packed into a single byte, and when you're adding unknown values, not incrementing. In this case, you just checkal >= 10
.My first version of this almost worked the first time (after fixing a couple syntax errors so it would assemble, and a silly crash that took a couple minutes to debug IIRC): It printed
fizz\nbuzz\n
in thefizzbuzz\n
case, and it reversed the digits. I keep forgetting that digit-strings need to be stored with the most-significant digit first, not like the bytes in a little-endian binary integer.Alternate ways to do things
I decided not to use a branchless version of the 1 digit vs. 2 digit ASCII conversion code, since it took a lot of instructions. Also, the branch should predict very well.
In 32bit (and 16bit) mode, there's a
div
instruction that takes an immediate operand, and usesAL
as the dividend, notAX
. It's calledAAM
, and was removed for AMD64 along with the other BCD/ASCII instructions. It's convenient for testing divisibility by 5 without tying up a register for the divisor or wasting an instruction inside the loop. It's slightly faster thandiv r/m8
, and sets flags according to the remainder (inal
: it has its outputs reversed, compared todiv
).Anatolyg's golfed FizzBuzz uses AAM in a loop with
shr ax, 8
to generate one digit at a time in reverse order, storing and decrementing a pointer.This version is way more complicated because it's using
AAM
to check for count%5 and then processing that into count%10, instead of doing a separate division to get ASCII digits.This is causing the segmentation fault:
Since you jump to the functions, the
ret
gets no return address and will return anywhere. Change it to acall/ret
-pair:This will cause an infinite loop:
AX
changes its values and is unfit to hold the loop-counter. I suggest:Use a debugger to single step your code and see where it goes wrong.
From a quick glance it's already obvious you are destroying
ax
(maybe you don't know thatax
is made up ofah
andal
?). Also you are jumping to functions instead of calling them, this is probably the cause of the faults.