I am programming assembly language (x86) in MASM using Visual Studio 2013 Ultimate. I am trying to use an array to calculate a Fibonacci sequence for n elements using an array. In other words, I am trying to go to an array element, obtain the two elements before it, add those up, and store the result in another array.
I am having trouble setting up the index registers to make this work.
I have my program setup like this:
TITLE fibonacci.asm
INCLUDE Irvine32.inc
.data
fibInitial BYTE 0, 1, 2, 3, 4, 5, 6
fibComputed BYTE 5 DUP(0)
.code
main PROC
MOVZX si, fibInitial
MOVZX di, fibComputed
MOV cl, LENGTHOF fibInitial
L1:
MOV ax, [si - 1]
MOV dx, [si - 2]
MOV bp, ax + dx
MOV dl, TYPE fibInitial
MOVZX si, dl
MOV [edi], bp
MOV dh, TYPE fibComputed
MOVZX di, dl
loop L1
exit
main ENDP
END main
I cannot compile this because of an error message that says "error A2031: must be index or base register" for the line MOV ebp, ax + dx
. However, I'm certain that there are other logic errors I am overlooking.
Considering that fib(93) = 12200160415121876738 is the largest value that will fit into a 64 bit unsigned integer, there may not be much point in trying to optimize this, unless calculating fib(n) modulo some (usually prime) number for large n.
There is a way to directly calculate fib(n) in log2(n) iterations, using a lucas sequence method or matrix method for fibonacci. The lucas sequence is faster and shown below. These could be modified to perform the math modulo some number.
related: Code-golf print the first 1000 digits of Fib(10**9): my x86 asm answer using an extended-precision
adc
loop, and converting binary to strings. The inner loop is optimized for speed, other parts for size.Computing a Fibonacci sequence only requires keeping two pieces of state: the current and previous element. I have no idea what you're trying to do with
fibInitial
, other than counting its length. This isn't perl where you dofor $n (0..5)
.I know you're just learning asm at all, but I'm still going to talk about performance. There's not much reason to learn asm without knowing what's fast and what's not. If you don't need performance, let a compiler make the asm for you, from C sources. Also see the other links at https://stackoverflow.com/tags/x86/info
Using registers for your state simplifies the problem of needing to look at
a[-1]
while calculatinga[1]
. You start withcurr=1
,prev=0
, and start witha[0] = curr
. To produce the "modern" starting-with-zero sequence of Fibonacci numbers, start withcurr=0
,prev=1
.Lucky for you, I was just thinking about an efficient loop for fibonacci code recently, so I took the time to write up a complete function. See below for an unrolled and a vectorized version (saves on store instructions, but also makes 64bit ints fast even when compiling for a 32bit CPU):
An alternate loop condition could be
AMD CPUs can fuse cmp/branch, but not dec/branch. Intel CPUs can also macro-fuse
dec/jnz
. (Or signed less than zero / greater than zero).dec/inc
don't update the Carry flag, so you can't use them with above/below unsignedja/jb
. I think the idea is that you could do anadc
(add with carry) in a loop, usinginc/dec
for the loop counter to not disturb the carry flag, but partial-flags slowdowns make this bad on modern CPUs.lea ecx, [eax + edx]
needs an extra byte (address-size prefix), which is why I used a 32bit dest and a 64bit address. (Those are the default operand sizes forlea
in 64bit mode). No direct impact on speed, just indirect through code size.An alternate loop body could be:
Unrolling the loop to do more iterations would mean less shuffling. Instead of
mov
instructions, you just keep track of which register is holding which variable. i.e. you handle assignments with a sort of register renaming.The thing with unrolling is that you need to clean up any odd iterations that are left over. Power-of-two unroll factors can make the cleanup loop slightly easier, but adding 12 isn't any faster than adding 16. (See the previous revision of this post for a silly unroll-by-3 version using
lea
to producecurr + prev
in a 3rd register, because I failed to realize that you don't actually need a temp. Thanks to rcgldr for catching that.)See below for a complete working unrolled version which handles any count.
Test frontend (new in this version: a canary element to detect asm bugs writing past the end of the buffer.)
This code is fully working and tested (unless I missed copying a change in my local file back into the answer >.<):
unrolled version
Thanks again to rcgldr for getting me thinking about how to handle odd vs. even count in the loop setup, rather than with a cleanup iteration at the end.
I went for branchless setup code, which adds 4 * count%2 to the starting pointer. That can be zero, but adding zero is cheaper than branching to see if we should or not. The Fibonacci sequence overflows a register very quickly, so keeping the prologue code tight and efficient is important, not just the code inside the loop. (If we're optimizing at all, we'd want to optimize for many calls with short length).
To produce the starting-with-zero sequence, do
instead of the current
We can also save a
mov
instruction in both versions by usingesi
to holdprev
, now thatprev
depends oncount
.vectorized:
The Fibonacci sequence isn't particularly parallelizable. There's no simple way to get F(i+4) from F(i) and F(i-4), or anything like that. What we can do with vectors is fewer stores to memory. Start with:
Then
a+=b; b+=a; a+=b; b+=a;
produces:This is less silly when working with two 64bit ints packed into a 128b vector. Even in 32bit code, you can use SSE to do 64bit integer math.
A previous version of this answer has an unfinished packed-32bit vector version that doesn't properly handle
count%4 != 0
. To load the first 4 values of the sequence, I usedpmovzxbd
so I didn't need 16B of data when I could use only 4B. Getting the first -1 .. 1 values of the sequence into vector registers is a lot easier, because there's only one non-zero value to load and shuffle around.No point unrolling this further, the dep chain latency limits throughput, so we can always average storing one element per cycle. Reducing the loop overhead in uops can help for hyperthreading, but that's pretty minor.
As you can see, handling all the corner cases even when unrolling by two is quite complex to keep track of. It requires extra startup overhead, even when you're trying to optimize that to keep it to a minimum. It's easy to end up with a lot of conditional branches.
updated main:
Performance
For count just below 8192, the unrolled-by-two non-vector version runs near its theoretical-max throughput of 1 store per cycle (3.5 instructions per cycle), on my Sandybridge i5. 8192 * 4B/int = 32768 = L1 cache size. In practice, I see ~3.3 to ~3.4 insn / cycle. I'm counting the entire program with Linux
perf
, though, not just the tight loop.Anyway, there's not really any point unrolling further. And obviously this stopped being a Fibonacci sequence after count=47, since we use uint32_t. However, for large
count
, the throughput is limited by memory bandwidth, down to ~2.6 insn / cycle. At this point we're basically looking at how to optimize memset.The 64bit vector version runs at 3 insns per cycle (one 128b store per two clocks) up to an array size of about 1.5 times L2 cache size. (i.e.
./fib64 49152
). As the array size goes up to larger fractions of L3 cache size, performance decreases down to ~2 insn per cycle (one store per 3 clocks) at 3/4 of L3 cache size. It levels out to 1 store per 6 cycles at sizes > L3 cache.So storing with vectors does better when we fit in L2 but not L1 cache.