So I have this problem I'm supposed to solve and I've spent hours trying to figure out the best way to do this, google hasn't been of much help.
The problem is to create a subroutine that is given a list of words that you then add with another list that becomes the output. Its basically a method of working with large numbers.
My code works fine for carry flags within words, but for carry flag from one full word to another it does not work. The first 16 bit word (0005 in the example below) is a flag used to tell my subroutine how many words there are.
For instance, given the following input,
//si 0005 0000 EEEE DDDD CCCC BBBB
//di 0005 0000 1111 2222 3333 4445
when the output expected is:
0005 0001 0000 0000 0000 0000
My code produces:
0005 0000 FFFF FFFF FFFF 0000
I believe I understand why this is happening for the most part, but am unsure of the best way to solve this issue. I need a low cost method of carrying over a 1 between different chunks of data.
;---------------------------------------
; ADD Subroutine
;---------------------------------------
.data
bxx dw 0000h ;
cxx dw 0000h ;
.code
;---------------------------------------
addx: ;
mov bxx, bx ;save incoming register
mov cxx, cx ;save incoming register
mov bx, si ;move n to bl - acts as a cursor
loopAdd: ;loop point
mov cx, [si+bx] ;move word at point si+bx into register cx
ADC [di+bx], cx ;add with carry
sub bx, 0002h; ;decrement cursor by a full word
cmp bx, 0000h ;bx == 0?
jg loopAdd ;no? jump to loop point
end: ;
mov bx, bxx ;return register to original state
mov cx, cxx ;return register to original state
ret ;return
;---------------------------------------
You have to save the carry flag from the previous iteration.
Try this:
Note that
cmp bx, 0000h
isn't needed becausecmp
issub
except forcmp
only modify flags and don't store calculated value, so you can check the result ofsub
directly.OP says he wants a low cost solution to preserve the carry between iterations. @MikeCAT had a solution; @PeterCordes suggested some improvements.
There's one more really nice improvement you can get when doing multiprecision arithmetic, under the assumption that your multiprecison value is "big" (contains many word values), and that's unrolling the inner loop N times, avoiding loop counting/carry flag damage inside the unrolled section. (If your multiprecision arithmetic is not very multi, you don't need a lot of optimization).
I've revised @MikeCAT's answer here, with the assumption that the unrolling should be 8 iterations.
The code has 3 parts: determining how to handle a fragment of 8 words, handling the fragment in an unrolled way, and then handling multiple 8 word chunks efficiently in the main unrolled loop.
For OPs' example of 5 words, this routine never gets to the full unrolled loop. For large multiword values, it does, and I'd expect this routine is probably pretty fast.
[The following code is untested.]
The sequence
suggests perhaps using the single-word block move instructions as an alternative:
with appropriate adjustments to the code, left to the reader.
Whether the loop I wrote or the LODS/STOS version is faster is something to measure carefully.
If you want fast multi-precision addition, use 64bit code if at all possible. Doing 4x the width with every instruction directly gives a 4x speedup. On 386-compatible CPUs, you can use 32bit instructions and registers even in 16bit mode, which will give you a 2x speedup.
To take Ira's unroll suggestion a step further
lea
adc
with a memory destination, which is slow on Intel.This should run at nearly one
adc
per clock (minus loop overhead) on Intel Broadwell, and on AMD K8 through Bulldozer. (I forget if k8/k10 can do two loads per clock. That'll be the bottleneck if it can't). Usingadc
with a memory destination is not good on Intel, but fine on AMD, according to Agner Fog's tables. Intel Haswell and earlier will be limited by the 2c latency ofadc
. (Broadwell madeadc
andcmov
single-uop instructions, making use of the 3-dependency uop support added in Haswell so FMA can be a single uop instruction).A
loop
insn might be a win on older CPUs where a partial-reg stall is really bad, but other ways to avoid a partial-flag stall might be even better than the slowloop
instruction.Using the dest-source trick reduces the loop overhead of decrementing the pointer. 2-register addressing modes in loads don't need to micro-fuse, because pure
mov
loads are a single uop anyway. stores do need to micro-fuse, so you should prefer one-register addressing modes for stores. Haswell's extra store-address unit on port7 can only handle simple addressing modes, and 2-register addressing modes can't micro-fuse.See also Problems with ADC/SBB and INC/DEC in tight loops on some CPUs for info about multiprecision adc loops and some experiments on Core2 vs. SnB CPUs for partial-flag stall performance.
Another way to loop here would be
lea si, -1[si]
/mov cx, si
/jcxz
. 16bit sucks, and you can't use[cx]
in an effective address.