I want to calculate a sequence of exponents of base 2. for example if given 3 the program would calculate (2^3 + 2^2 + 2^1 + 2^0). The problem is I cant calculate the exponent to being with.
I have tried shl
to calculate the exponent and I have tried a loop. Neither produce the correct result, my latest attempt is below.
.586
.MODEL FLAT
INCLUDE io.h
.STACK 4096
.DATA
Exponent DWORD ?
Prompt BYTE "Enter an exponent", 0
Base DWORD 2
string BYTE 40 DUP (?)
resultLbl BYTE "The solution is", 0
Solution DWORD 20 DUP (?), 0
.CODE
_MainProc PROC
input Prompt, string, 40 ; read ASCII characters (N)
atod string ; ascii to double
mov exponent, eax ; store in memory
push Exponent
push Base
call function
add esp, 8
dtoa Solution, eax ; convert to ASCII characters
output resultLbl, Solution ; output label and sum
mov eax, 0 ; exit with return code 0
ret
_MainProc ENDP
;@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
; Procedure takes base and exponent from main
Function PROC
push ebp ; stores base pointer
mov ebp, esp ; set base pointer to current position
mov ebx, [ebp + 12] ; read base (2)
mov ecx, [ebp + 8] ; read N
MainLoop: ; loop to calculate expoent
mul ebx ; multiply base by its self
add eax, ebx ; save answer in eax
dec ecx ; decrement the exponent
cmp ecx, 0 ; check if exponent is 0 yet
je exit ; Exit if exponent is 0
jg mainloop ; stay in loop if exponent is above 0
Exit:
pop ebp ; restore base pointer
ret
function ENDP
END
I want the program to produce the correct exponent result, and if you can calculate the sequence above, for example if given 3 the program would calculate (2^3 + 2^2 + 2^1 + 2^0).
Base 2 is an extremely special case because computers use binary integers.
2^n = 1 << n
i.e. 2n = 1 left-shiftedn
times. i.e. that bit-position is set.You're asking for
2^0 + 2^1 + ... 2^n
. A number with all bits set up to an including bitn
.That's 1 less than
2^(n+1)
, so the number you want is2^(n+1) - 1
.x86 has an instruction to set a bit at a given position in a register:
bts reg, reg
= Bit Test and Set. On Intel CPUs, the reg-reg form decodes to a single uop, with 1-cycle latency. (https://agner.org/optimize/) It also puts the old value of the bit in CF, but normally we don't care about that when using it for this. Normally we just zero a destination register and use BTS to set a single bit.On Ryzen,
bts reg,reg
is 2 uops and variable-count shifts are single-uop, somov edx, 1
/shl edx, cl
is good, too. But that's worse for code-size, and more uops on Intel CPUs.On Intel Sandybridge-family CPUs,
shl reg, cl
decodes to 3 uops. BMI2shlx reg, reg, reg
is 1 uop, but BMI2 support is still far from being something you can assume, unfortunately.(Memory-destination BTS with a register source is very slow, treating the source as an indexing into an arbitrary length bit-string, not just the addressed dword, so normally avoid it for performance.)
Calculating
2^(n+1) - 1
without overflow(C version of this answer on another question, with compiler output.)
If we just added 1 to the input number before feeding it to BTS, it could wrap around and fail to work for
31
(which should set all bits, but would instead set bit32%32 = 0
)Since we need one extra instruction after reading the input anyway, we can use x86's shift-and-add instruction (LEA) to do one more shift as we subtract 1. So with
n=31
we start with the high bit set, and shift it out to get zero. Subtracting 1 then sets all the bits, as desiredThe logic for this goes as follows
It's not a coincidence that the last column is a running total of the 2nd column.
LEA with 3 "components" to the addressing mode has extra latency vs. simpler LEAs, e.g. 3-cycle latency on Sandybridge-family vs. 1 cycle. But it's still a single uop so it's a great choice for throughput.
If we really wanted to optimize, and didn't have to worry about the
n=31
case overflowing, we'd write the ASCII -> int loop by hand but start with a total of1
instead of0
, to fold then+1
into findingn
. Thenbts
would give us our2^(n+1)
, and we could simplydec
that.There's no need to store
exponent
to memory and reload it again; you already have it in a register where you want it.Your comment on the
atod
line is wrong, and your code wouldn't make sense if it was right. ASCII todouble
(like the C functionstrtod
) would return in x87st0
or SSE2xmm0
, not EAX.atod
actually stands for ASCII (decimal) to integer. Perhaps thed
stands for DWORD.It takes 2 instructions to implement
1<<n
so it's silly to put it into its own function. Just passing args + calling the function would have take as much work as simply usingbts
.Compilers typically use
mov edx,1
+mov ecx, eax
+shl edx, cl
. But that's a missed optimization, especially on Intel CPUs.BMI2
shlx edx, edx, ecx
would help (avoiding themov
), but is worse code-size thanxor
-zero +bts
. And BMI2 isn't always available.If you can't get your compiler to use BTS (or you have BMI2 for SHLX), another good implementation is
(2ULL << n) - 1
so you bake the extra shift into the number you start shifting. So a count of31
can shift the bit out and produce 0.