FizzBuzz in assembly - segmentation fault

I am trying to write FizzBuzz in Assembly and I am seeing segmentation fault all the time. So far I have determined that it is not my printing routines (because I have removed their contents and the problem persists) and the error hides somewhere in the main function.

I was getting this output when I run the program:

fizzSegmentation fault

Leading me to believe that it's not the problem with using division and looking up the remainders. But I could be wrong, I haven't done Assembly in two years...

SECTION .data
global _start
    fizz: db "fizz", 4
    buzz: db "buzz", 4

SECTION .bss
    counter: resb    1

SECTION .text
_start:

    mov ax,0
    mov [counter],ax

main_loop:

    cmp ax,100          ;from 0 to 100
    je  exit            ;
    mov bl,3            ;divisor
    mov ah,0            ;here will be a remainder
    div bl              ;divide
    cmp ah,0            ;compare the remainder with 0
    je  print_fizz      ;print fizz if they equal
    mov bl,5            ;new divisor
    mov ah,0            ;do I have to do it every time?
    div bl              ;divide
    cmp ah,0            ;compare the remainder with 0
    je  print_buzz      ;print buzz if they equal
    jmp print_ax        ;print contents of ax if not
    inc ax              ;increment ax
    jmp main_loop       ;jump to label

print_ax:
    ret

print_fizz:
    ret

print_buzz:
    ret

exit:
    mov rax,1
    mov rbx,0
    int 80h
    ret

I am compiling using:

yasm -f elf64 -o fizzbuzz.o fizzbuzz.asm
ld -d -o fizzbuzz fizzbuzz.o

回答1:

This is causing the segmentation fault:

...
    je  print_fizz      ;print fizz if they equal
...
    je  print_buzz      ;print buzz if they equal
    jmp print_ax        ;print contents of ax if not
...

print_ax:
    ret

print_fizz:
    ret

print_buzz:
    ret
...

Since you jump to the functions, the ret gets no return address and will return anywhere. Change it to a call/ret-pair:

...
;   je  print_fizz      ;print fizz if they equal
    jne .1              ;skip if not equal
    call print_fizz
    .1:
...

;   je  print_buzz      ;print buzz if they equal
    jne .2              ;skip if not equal
    call print_buzz
    .2:

;   jmp print_ax        ;print contents of ax if not
    call print_ax
...

This will cause an infinite loop:

mov ax,0
mov [counter],ax

main_loop:

    cmp ax,100          ;from 0 to 100
    je  exit
    ...
    mov ah,0            ;here will be a remainder
    div bl              ;divide
    ...
    mov ah,0            ;do I have to do it every time?
    div bl              ;divide
    ...
    inc ax              ;increment ax
    jmp main_loop       ;jump to label

AX changes its values and is unfit to hold the loop-counter. I suggest:

...
main_loop:

;   cmp ax,100          ;from 0 to 100
    cmp byte [counter], 100
...
;   inc ax              ;increment ax
    inc byte [counter]
    jmp main_loop       ;jump to label
...

回答2:

This answer ended up a lot longer than I was planning, kind of a tutorial on writing efficient asm. i.e. how to make a simple problem complicated.

In case anyone's interested in a code-review of the attempted implementation, and a version with a lot of asm tricks:

There's so many small ways this could be better, e.g. keep 5 in bh and 3 in bl. You don't always have to use div bl. AMD64 has 20 one-byte registers. (al/ah, bl/bh, cl/ch, dl/dh (no REX), and sil, dil, ... r15b (REX required)).

Using a 16bit counter is at least a waste of bytes (operand-size prefixes), and can cause slowdowns. Using mov reg,0 is bad. Put the the conditional branch at the bottom of the loop whenever possible.

mov rax, 1 is a waste of instruction bytes compared to mov eax, 1, and this is tagged yasm, which doesn't optimize that for you at assemble time. (nasm itself does.) Setting up 64bit registers and then using the int 0x80 32bit compatibility ABI is even more silly.

Storing the 16bit counter into memory is silly in the first place, but storing it to an address where you've only reserved one byte is asking for trouble.

Besides the small stuff, FizzBuzz(3,5) is small enough to unroll and avoid some divs entirely. With assembler macros, you could easily produce a fully-unrolled loop with LCM(fizz,buzz) outputs per loop (i.e. 15 in this case); enough for the pattern to repeat so you don't need any conditionals.

You can avoid div without unrolling by using down-counters to detect count%5==0 and count%3==0. @anatolyg's 16bit DOS code-golf FizzBuzz does that. This is a really common technique to do something every N times something else happens. e.g. performance counter events work this way.

Here's my attempt at an efficient FizzBuzz (for AMD64 Linux), using no libraries. only `write(2)` and `exit_group(2)`

There's no compiler, so if you want good code, you have to write it yourself. You can't just hope the compiler will do something good with i%3 in a loop (which it doesn't anyway, for most compilers).

The code evolved a lot as I was writing it. As usual, starting to implement one way gives you better ideas when you see that your first idea takes more or slower instructions than you hoped.

I unrolled by 3 (Fizz) to remove all checks of counter%3. I handled counter%5 checks with a countdown from 5 instead of division. This still takes a fair bit of logic that would go away with a full unroll to the point where the pattern repeats (LCM(3,5)). The integer to ASCII digits code could be in a function, or inlined into the unrolled loop for very bloated code.

I keep everything in registers (including the constants fizz\n and buzz\n). There are no loads, and stores only into the buffer. Many of the registers are set once outside the loop, rather than with a mov-immediate right before use. This requires good comments to keep track of what you put where.

I append characters to a buffer which we write(2) after every fizzbuzz\n line. This is the longest cycle that occurs naturally in the logic of the program, and means we only need the syscall code in one place.

In a real program that might write to a file or pipe, it would be better to use C stdio's strategy of using a much larger buffer in that case. (Many ~100 byte writes are much worse than fewer 4096B writes.) Still, I thought this was an interesting choice between the traditional printf every iteration or accumulating the entire string into one big buffer. I used a static buffer instead of reserving stack space because I'm writing a whole program, not a function that should avoid wasting memory after returning. Also, it let me use 32bit operand-size for pointer increments, to save code bytes (REX prefixes).

It would be pretty easy to accumulate multiple blocks, until you get to a point where the next group might go past the end of the buffer. i.e. compare the current position against buffer_end - BUZZMOD*FIZZMOD*9. Optimizing I/O system calls is obviously a broad topic, and this version is sufficient to demonstrate accumulating a string in a buffer.

;  for (count=1..100):
;  if(count%3 == 0) { print_fizz(); }
;  if(count%5 == 0) { print_buzz(); } else {
;       if(count%3 && count%5) print(count);
;; }
;  print(newline)

; We don't need pointers to these strings at all;  The strings are immediate data for a couple mov instructions
;SECTION .rodata        ; put constants in .rodata.
;    fizz: db "fizz"    ; No idea what the trailing  4  was for
;    buzz: db "buzz"

FIZZMOD equ 3                   ; only 3 works, but it would be easy to use a loop
BUZZMOD equ 5                   ; any value works
LASTCOUNT equ 100    ; max 100: we only handle two decimal digits.
; TODO: cleanup that can handle LASTCOUNT%FIZZMOD != 1 and LASTCOUNT%BUZZMOD != 0


SECTION .bss
;;; generate a string in this buffer.  (flush it with write(2) on "fizzbuzz" lines)
;    buf: resb    4096
buf: resb    FIZZMOD * BUZZMOD * 9     ; (worst case: every line is "fizzbuzz\n")

SECTION .text
global _start
_start:

    ; args for write(2).  (syscall clobbers rcx/r11,  and rax with the return value)
    mov   edi, 1                ; STDOUT_FILENO.  also happens to be __NR_write in the AMD64 Linux ABI
    mov   esi, buf              ; static data lives in the low 2G of address space, so we don't need a 64bit mov
    ;; edx = count.             ; calculated each iteration
    ;; mov eax, edi             ; also needed every time.   saves 3B vs  mov eax, imm32

    ; 'fizz' is only used once, so we could just store with an immediate there.  That wouldn't micro-fuse, and we'd have to do the newline separately
    mov   r10b, 10      ; base 10
    ;;mov   r14d, BUZZMOD  ; not needed, we don't div for this
    mov   r12, 'fizz' | 10<<32      ; `fizz\n`, but YASM doesn't support NASM's backquotes for \-escapes
    mov   r13, 'buzz' | 10<<32      ; `buzz\n`.  When buzz appears, it's always the end of a line


;;;;;;;; Set up for first iteration
    mov   ebp, BUZZMOD          ; detect count%BUZZMOD == 0 with a down-counter instead of dividing
    mov   ebx, 1                ; counter starts at 1
    mov   edx, esi              ; current output position = front of buf
ALIGN 16
main_loop:

    ;; TODO: loop FIZZMOD-1 times inside buzz_or_number, or here
    ;; It doesn't make much sense to unroll this loop but not inline buzz_or_number :/
    call  buzz_or_number
    inc   ebx

    call  buzz_or_number
    add   ebx, 2                ; counter is never printed on Fizz iterations, so just set up for next main_loop

    ;; Fizz, and maybe also Buzz
    mov   qword [rdx], r12      ; Fizz with a newline
    add   edx, 5                ; TODO: move this after the branch; adjust the offsets in .fizzbuzz

    dec   ebp
    jz   .fizzbuzz

;;.done_buzz:   ; .fizzbuzz duplicates the main_loop branch instead of jumping back here
    cmp   ebx, LASTCOUNT-FIZZMOD
    jbe   main_loop
;;;;;;;;;; END OF main_loop


.cleanup:
;;;;;;;;;;;;;;;;;;;;;  Cleanup after the loop
    ; hard-code the fact that 100 % FIZZMOD = 1 more line to print,
    ; and that 100 % BUZZMOD = 0, so the line is "buzz\n"

    mov   eax, edi              ; __NR_write
    mov   [rdx], r13            ; the final "buzz\n".
    sub   edx, buf - 5          ; write_count = current_pos+5 - buf.
    syscall                     ; write(1, buf, p - buf).
    ;; if buf isn't static, then use  add   edx, 5 / sub   edx, esi

    xor edi, edi
    mov eax, 231    ;  exit_group(0).  same as eax=60: exit() for a single-threaded program
    syscall


;;;;; The fizzbuzz case from the loop
.fizzbuzz:
;; count%BUZZMOD == 0:   rdx points after the \n at the end of fizz\n, which we need to overwrite

;; this is a macro so we can use it in buzz_or_number, too, where we don't need to back up and overwrite a \n
%macro  BUZZ_HIT 1
    mov   [rdx - %1], r13       ; buzz\n.  Next line will overwrite the last 3 bytes of the 64b store.
    add   edx, 5 - %1
    mov   ebp, BUZZMOD          ; reset the count%BUZZMOD down-counter
%endmacro

    BUZZ_HIT 1                  ; arg=1 to back up and overwrite the \n from "fizz\n"

    sub   edx, esi              ; write_count = current_pos - buf
    mov   eax, edi              ; __NR_write
    syscall                     ; write(1, buf, p - buf).  clobbers only rax (return value), and rcx,r11
    mov   edx, esi              ; restart at the front of the buffer

;;; tail-duplication of the main loop, instead of jmp back to the cmp/jbe
;;; could just be a jmp main_loop, if we check at assemble time that  LASTCOUNT % FIZZMOD != 0 || LASTCOUNT % BUZZMOD != 0
    cmp   ebx, LASTCOUNT-FIZZMOD
    jbe   main_loop
    jmp   .cleanup

;;;;;;;;;;;;;;;;;;;;;;; buzz_or_number: called for non-fizz cases
; special calling convention: uses (without clobbering) the same regs as the loop
;; modifies: BUZZMOD down-counter, output position pointer
;; clobbers: rax, rcx
ALIGN 32
buzz_or_number:
    dec   ebp
    jnz  .no_buzz              ; could make this part of the macro, but flow-control inside macros is probably worse than duplication

;; count%BUZZMOD == 0:  append "buzz\n" to the buffer and reset the down-counter
    BUZZ_HIT  0                 ; back up 0 bytes before appending
    ret

.no_buzz:             ;; get count as a 1 or 2-digit ASCII number
    ;; assert(ebx < 10);   We don't handle 3-digit numbers

    mov   eax, ebx
    div   r10b                  ; al = count/10 (first (high) decimal digit), ah = count%10 (second (low) decimal digit).
    ;; x86 is little-endian, so this is in printing-order already for storing eax

    ;movzx eax, ax            ; avoid partial-reg stalls on pre-Haswell
    ;; convert integer digits to ASCII by adding '0' to al and ah at the same time, and set the 3rd byte to `\n`.
    cmp   ebx, 9                ; compare against the original counter instead of the div result, for more ILP and earlier detection of branch misprediction
    jbe   .1digit               ; most numbers from 1..100 are 2-digit, so make this the not-taken case
    add   eax, 0x0a3030   ;;  `00\n`: converts 2 integer digits -> ASCII
    ;; eax now holds the number + newline as a 3-byte ASCII string
    mov   [rdx], eax
    add   edx, 3
    ret

.1digit:
;; Could use a 16bit operand-size here to avoid partial-reg stalls, but an imm16 would LCP-stall on Intel.
    shr   eax, 8                ; Shift out the leading 0
    add   eax, 0x000a30   ;; 1-digit numbers
    ;; eax now holds the number + newline as a 2-byte ASCII string
    mov   [rdx], ax
    add   edx, 2
    ret

This is how it runs:

$ strace ./fizzbuzz > /dev/null
execve("./fizzbuzz", ["./fizzbuzz"], [/* 69 vars */]) = 0
write(1, "1\n2\nfizz\n4\nbuzz\nfizz\n7\n8\nfizz\nbu"..., 58) = 58
write(1, "16\n17\nfizz\n19\nbuzz\nfizz\n22\n23\nfi"..., 63) = 63
write(1, "31\n32\nfizz\n34\nbuzz\nfizz\n37\n38\nfi"..., 63) = 63
write(1, "46\n47\nfizz\n49\nbuzz\nfizz\n52\n53\nfi"..., 63) = 63
write(1, "61\n62\nfizz\n64\nbuzz\nfizz\n67\n68\nfi"..., 63) = 63
write(1, "76\n77\nfizz\n79\nbuzz\nfizz\n82\n83\nfi"..., 63) = 63
write(1, "91\n92\nfizz\n94\nbuzz\nfizz\n97\n98\nfi"..., 40) = 40
exit_group(0)                           = ?

Correctness check:

./fizzbuzz | diff - <(perl -E'say((fizz)[$_%3].(buzz)[$_%5]or$_)for+1..100')
# no output = no difference

Unrolling over Buzz (5) and using a down-counter for Fizz would probably be worse. My version has a 64bit store of fizz\n\0\0\0 and then a branch to decide whether to store buzz\n\0\0\0 overlapping to produce fizzbuzz\n. The other way would have a branch to decide whether to store fizz (no newline needed, so it can be a 32bit store). Then it would unconditionally store buzz\n\0\0\0. However, with FIZZMOD being smaller than BUZZMOD, it means more frequent resets of the down-counter, and more checks to see if we need to print a number instead of a string this iteration. Having every third line known to be fizz\n or fizzbuzz\n means the simpler code for that runs more often.

If overlapping stores are a problem, this whole algorithm is screwed, and this is just one of many. Also, we could just branch before storing fizz\n and adding 5. Then in the fizzbuzz\n case, we do two stores and add 9. This also separates the dec/jcc from the cmp/jcc at the bottom of main_loop, so they can hopefully both macro-fuse on pre-Haswell. IIRC, some CPUs have branch predictors that really don't like multiple branches really close to each other.

Further improvements, left as an exercise for the reader:

Inline buzz_or_number, and maybe turn it into a loop (FIZZMOD-1 iterations)
Besides that, it could probably branch less, and other minor improvements. This is sort of a version 1.1: working, tested, with some comments and observations added while writing up this answer, but not actually improving the code much from what I initially decided was good enough to see if it worked.
Make it more flexible by writing a cleanup loop (or assembler macros) for the last LASTCOUNT % FIZZMOD lines, instead of assuming that it's 1 line. Cleanup code is the downside to unrolling.
I used a div by 10 to convert the counter to a string. A better implementation would use a multiplicative inverse, like compilers generate for small constant divisors (implemented in this case with LEA).

Another strategy worth considering is a strength-reduction to incrementing a sequence of ASCII digits (stored in a register). This technique would be harder to extend to numbers with more digits. Storing them in printing order (most significant digit in the low byte) makes carry between digits work against us instead of for us. (e.g. if they were in natural order, you could add eax, 256-10 to correct the low digit and increment the high digit via carry.) It might be worth it to keep it this way, but BSWAP to store. Keeping a \n embedded in the register so it only takes one store might not be worth it. Detecting and handling a 1 digit number becoming a 2 digit number is bad enough.

In 32bit mode, we could maybe use the AAA instruction to do decimal carry after incrementing. However, despite the mnemonic, it works on BCD (0-9), not ASCII ('0'-'9'), and doesn't appear to make it easy to propagate the carry to a 3rd digit. No wonder AMD removed it for AMD64. It checks the AF flag to detect carry out of the low 4 bits, but that's only useful for DAA, where you have two BCD digits packed into a single byte, and when you're adding unknown values, not incrementing. In this case, you just check al >= 10.

My first version of this almost worked the first time (after fixing a couple syntax errors so it would assemble, and a silly crash that took a couple minutes to debug IIRC): It printed fizz\nbuzz\n in the fizzbuzz\n case, and it reversed the digits. I keep forgetting that digit-strings need to be stored with the most-significant digit first, not like the bytes in a little-endian binary integer.

Alternate ways to do things

I decided not to use a branchless version of the 1 digit vs. 2 digit ASCII conversion code, since it took a lot of instructions. Also, the branch should predict very well.

  ;; Untested
  buzz_or_number:
  ...
  .no_buzz:
    ... 
    div   r10b
 DECIMAL_TO_ASCII_NEWLINE_2DIGIT equ 0x0a3030   ; add '0' to two unpacked decimal digits, and a newline
 DECIMAL_TO_ASCII_NEWLINE_1DIGIT equ 0x000a30

    ;; hoist this out of the loop: mov   r15d, DECIMAL_TO_ASCII_NEWLINE_2DIGIT - DECIMAL_TO_ASCII_NEWLINE_1DIGIT
    xor   ecx,ecx
    cmp   ah, 1            ; set CF if ah=0 (1 digit number), otherwise clear it.  This allows sbb for a conditional add, instead of setcc
    cmovae ecx, r15d       ; 0 or the difference from 1digit to 2digit

    lea   eax, [rax+rcx + DECIMAL_TO_ASCII_NEWLINE_1DIGIT]  ; rax+=0x0a3030 or 0x000a30, without clobbering flags
    mov   [rdx], eax
    sbb   edx, -3          ; add 2 (-(-3) - 1) or 3.
    ret

In 32bit (and 16bit) mode, there's a div instruction that takes an immediate operand, and uses AL as the dividend, not AX. It's called AAM, and was removed for AMD64 along with the other BCD/ASCII instructions. It's convenient for testing divisibility by 5 without tying up a register for the divisor or wasting an instruction inside the loop. It's slightly faster than div r/m8, and sets flags according to the remainder (in al: it has its outputs reversed, compared to div).

Anatolyg's golfed FizzBuzz uses AAM in a loop with shr ax, 8 to generate one digit at a time in reverse order, storing and decrementing a pointer.

This version is way more complicated because it's using AAM to check for count%5 and then processing that into count%10, instead of doing a separate division to get ASCII digits.

 ;; Untested
buzz_or_number_div:
    mov   eax, ebx

    aam   5               ; al = al%5  ah = al/5.  (opposite locations from div), and sets flags according to the remainder.
    jz    print_buzz      ; tailcall


    ; fall through into print_counter
;print_counter:
    ; maybe use the result of div by 5 to get division by 10?  
    ; shifting the low bit of the quotient into bit 4 of the remainder should be faster than dividing again.
    ;; after AAM: ah = 5bit quotient (qqqqQ), al = 3bit remainder(RRR)
    ;; starting point:     ; AX = [ 000qqqqQ 00000RRR ]
    ;; desired = byte swapped as well: [ 0000QRRR 0000qqqq ]
    shl   al, 5            ; AX = [ 000qqqqQ RRR00000 ]
    shr   ax, 1            ; AX = [ 0000qqqq QRRR0000 ]
    ror   ax, 8            ; AX = [ QRRR0000 0000qqqq ]  ; simple byte-swap
    shr   ah, 4            ; AX = [ 0000QRRR 0000qqqq ]

    add   eax, ...;  convert to ascii
    ...
    ret

    ; those instructions are all single-uop 1c latency on SnB-family, but pre-Haswell will insert extra merging uops.  (And stall while doing so, on pre-SnB).
    ; and there's another partial-reg stall when we read eax

    ; It might be possible to do this bit manipulation with fewer operations, or maybe different ones.  (maybe copy ax to cx, so we can move from cl or ch to al or ah?)

;    shr   ah, 1           ; AX = [ 0000qqqq 00000RRR ]  CF=Q   ; then what? setc/shift/or?  rcl is slow, too.
;    rorx  eax, eax, 32-4  ; AX = [ qqqq0000 0RRR0000 ]  CF=Q
;  nope, seems a dead end

;    shl   ah, 3           ; AX = [ qqqqQ000 00000RRR ]
;    ror   ax, 7           ; AX = [ 0000RRRq qqqQ0000 ]
;    shr   al, 4           ; AX = [ 0000RRRq 0000qqqQ ]
;  oops, no, shifts the wrong way.

;    shl   ah, 3           ; AX = [ qqqqQ000 00000RRR ]
;    or    ah, al          ; AX = [ qqqqQRRR 00000RRR ]
;    xor   al,al           ; AX = [ qqqqQRRR 00000000 ]
;    rol   ax, 4           ; AX = [ QRRR0000 0000qqqq ]
;    shr   ah, 4           ; AX = [ QRRR0000 qqqq0000 ]
; only 3 shifts, but still partial-reg heavy.  Interesting on Haswell

;    ror   ax, 9           ; AX = [ Q00000RR R000qqqq ]  CF=Q

回答3:

Use a debugger to single step your code and see where it goes wrong.

From a quick glance it's already obvious you are destroying ax (maybe you don't know that ax is made up of ah and al?). Also you are jumping to functions instead of calling them, this is probably the cause of the faults.