How to calculate the sum of a sequence of powers o

2019-08-16 20:35发布

I want to calculate a sequence of exponents of base 2. for example if given 3 the program would calculate (2^3 + 2^2 + 2^1 + 2^0). The problem is I cant calculate the exponent to being with.

I have tried shl to calculate the exponent and I have tried a loop. Neither produce the correct result, my latest attempt is below.

.586
.MODEL FLAT
INCLUDE io.h                            
.STACK 4096

.DATA

        Exponent    DWORD   ?           
        Prompt  BYTE    "Enter an exponent", 0

        Base        DWORD  2

        string      BYTE    40 DUP (?)

        resultLbl   BYTE    "The solution is", 0
        Solution    DWORD   20 DUP (?), 0

.CODE
_MainProc PROC

    input   Prompt, string, 40      ; read ASCII characters (N)
        atod    string              ; ascii to double
        mov     exponent, eax       ; store in memory     


        push Exponent
        push Base

        call function
        add esp, 8

        dtoa   Solution, eax         ; convert to ASCII characters
        output  resultLbl, Solution   ; output label and sum



                mov     eax, 0      ; exit with return code 0
        ret

_MainProc ENDP


;@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
; Procedure takes base and exponent from main

Function PROC
    push ebp                      ; stores base pointer
    mov  ebp, esp                 ; set base pointer to current position

    mov ebx, [ebp + 12]   ; read base (2)
    mov ecx,  [ebp + 8]       ; read N



    MainLoop:           ; loop to calculate expoent
          mul ebx           ; multiply base by its self
          add eax, ebx           ; save answer in eax
          dec ecx                ; decrement the exponent
          cmp ecx, 0             ; check if exponent is 0 yet
          je exit                ; Exit if exponent is 0
          jg mainloop            ; stay in loop if exponent is above 0

    Exit: 

        pop ebp                   ; restore base pointer 
        ret

function ENDP



END                  

I want the program to produce the correct exponent result, and if you can calculate the sequence above, for example if given 3 the program would calculate (2^3 + 2^2 + 2^1 + 2^0).

1条回答
姐就是有狂的资本
2楼-- · 2019-08-16 21:12

Base 2 is an extremely special case because computers use binary integers.

2^n = 1 << n i.e. 2n = 1 left-shifted n times. i.e. that bit-position is set.

You're asking for 2^0 + 2^1 + ... 2^n. A number with all bits set up to an including bit n.
That's 1 less than 2^(n+1), so the number you want is 2^(n+1) - 1.


x86 has an instruction to set a bit at a given position in a register: bts reg, reg = Bit Test and Set. On Intel CPUs, the reg-reg form decodes to a single uop, with 1-cycle latency. (https://agner.org/optimize/) It also puts the old value of the bit in CF, but normally we don't care about that when using it for this. Normally we just zero a destination register and use BTS to set a single bit.

On Ryzen, bts reg,reg is 2 uops and variable-count shifts are single-uop, so mov edx, 1 / shl edx, cl is good, too. But that's worse for code-size, and more uops on Intel CPUs.

On Intel Sandybridge-family CPUs, shl reg, cl decodes to 3 uops. BMI2 shlx reg, reg, reg is 1 uop, but BMI2 support is still far from being something you can assume, unfortunately.

(Memory-destination BTS with a register source is very slow, treating the source as an indexing into an arbitrary length bit-string, not just the addressed dword, so normally avoid it for performance.)


Calculating 2^(n+1) - 1 without overflow

(C version of this answer on another question, with compiler output.)

If we just added 1 to the input number before feeding it to BTS, it could wrap around and fail to work for 31 (which should set all bits, but would instead set bit 32%32 = 0)

Since we need one extra instruction after reading the input anyway, we can use x86's shift-and-add instruction (LEA) to do one more shift as we subtract 1. So with n=31 we start with the high bit set, and shift it out to get zero. Subtracting 1 then sets all the bits, as desired

xor    edx,edx                 ; edx = 0
bts    edx, eax                ; edx |=  1<<eax
lea    edx, [edx + edx - 1]    ; edx = (edx<<1) - 1

The logic for this goes as follows

n      BTS result (2^n)    *2 - 1
0    ->       1         ->   1        = 2 - 1
1    ->       2         ->   3        = 4 - 1
2    ->       4         ->   7        = 8 - 1
3    ->       8         ->  15        = 16 - 1
...
30   ->  0x40000000     ->  0x7FFFFFFF  = 0x80000000 - 1
31   ->  0x80000000     ->  0xFFFFFFFF  = 0 - 1

It's not a coincidence that the last column is a running total of the 2nd column.

LEA with 3 "components" to the addressing mode has extra latency vs. simpler LEAs, e.g. 3-cycle latency on Sandybridge-family vs. 1 cycle. But it's still a single uop so it's a great choice for throughput.

If we really wanted to optimize, and didn't have to worry about the n=31 case overflowing, we'd write the ASCII -> int loop by hand but start with a total of 1 instead of 0, to fold the n+1 into finding n. Then bts would give us our 2^(n+1), and we could simply dec that.


There's no need to store exponent to memory and reload it again; you already have it in a register where you want it.

Your comment on the atod line is wrong, and your code wouldn't make sense if it was right. ASCII to double (like the C function strtod) would return in x87 st0 or SSE2 xmm0, not EAX. atod actually stands for ASCII (decimal) to integer. Perhaps the d stands for DWORD.

    input   Prompt, string, 40      ; read ASCII characters (N)
    atod    string                 ; ascii (decimal) to integer in EAX

    xor    edx,edx                 ; edx = 0
    bts    edx, eax                ; edx |=  1<<eax
    lea    edx, [edx + edx - 1]    ; edx = (edx<<1) - 1

    dtoa    Solution, edx

It takes 2 instructions to implement 1<<n so it's silly to put it into its own function. Just passing args + calling the function would have take as much work as simply using bts.

0 -> 1 -> 1
1 -> 2 -> 3
2 -> 4 -> 7

Compilers typically use mov edx,1 + mov ecx, eax + shl edx, cl. But that's a missed optimization, especially on Intel CPUs.

BMI2 shlx edx, edx, ecx would help (avoiding the mov), but is worse code-size than xor-zero + bts. And BMI2 isn't always available.

If you can't get your compiler to use BTS (or you have BMI2 for SHLX), another good implementation is (2ULL << n) - 1 so you bake the extra shift into the number you start shifting. So a count of 31 can shift the bit out and produce 0.

查看更多
登录 后发表回答