What is the complexity / real cost of exp in cmath

2019-07-25 16:48发布

问题:

[I globally edited the question to be more "useful" and clear]

I was wondering about the complexity of the implementation of the function exp in cmath.

By complexity, I mean algorithmic complexity if possible. Otherwise cost compared to a floating point operation (addition for example)

The following lines :

double x = 3;
double y = std::exp(x);

compile to :

...
19,23d16
       movq    %rax, -40(%rbp)
       movsd   -40(%rbp), %xmm0
       call    exp
       movsd   %xmm0, -40(%rbp)
       movq    -40(%rbp), %rax
...

exp has to be dynamically loaded at runtime but I can't find many information on the implementation algorithmic complexity. It seems there's no call to a special processor instruction (at least on my x86_64 platform with gcc) so there must be an implementation somewhere that I can't find. On my mind, the algorithm is likely to use the binary representation of the input to have a very weak complexity but I haven't been able to find a valuable reference on this topic.

Maybe speaking of algorithmical complexity is not really possible in this case and all what we can do is testing (cf. answers below) but I don't know how we can quantify objectively the difference between a floating point operation and a call to exp ?

回答1:

It seems that the complexity is actually constant since MSVC9 compiler does some bit-magic involving specific tables, bitmasks and biases. As there are few branches after all instruction pipeline should help a lot. Below is what it does actually.

unpcklpd    xmm0,xmm0 
movapd      xmm1,xmmword ptr [cv] 
movapd      xmm6,xmmword ptr [Shifter] 
movapd      xmm2,xmmword ptr [cv+10h] 
movapd      xmm3,xmmword ptr [cv+20h] 
pextrw      eax,xmm0,3 
and         eax,7FFFh 
mov         edx,408Fh 
sub         edx,eax 
sub         eax,3C90h 
or          edx,eax 
cmp         edx,80000000h 
jae         RETURN_ONE 
mulpd       xmm1,xmm0 
addpd       xmm1,xmm6 
movapd      xmm7,xmm1 
subpd       xmm1,xmm6 
mulpd       xmm2,xmm1 
movapd      xmm4,xmmword ptr [cv+30h] 
mulpd       xmm3,xmm1 
movapd      xmm5,xmmword ptr [cv+40h] 
subpd       xmm0,xmm2 
movd        eax,xmm7 
mov         ecx,eax 
and         ecx,3Fh 
shl         ecx,4 
sar         eax,6 
mov         edx,eax 
subpd       xmm0,xmm3 
movapd      xmm2,xmmword ptr Tbl_addr[ecx] 
mulpd       xmm4,xmm0 
movapd      xmm1,xmm0 
mulpd       xmm0,xmm0 
addpd       xmm5,xmm4 
mulsd       xmm0,xmm0 
addsd       xmm1,xmm2 
unpckhpd    xmm2,xmm2 
movdqa      xmm6,xmmword ptr [mmask] 
pand        xmm7,xmm6 
movdqa      xmm6,xmmword ptr [bias] 
paddq       xmm7,xmm6 
psllq       xmm7,2Eh 
mulpd       xmm0,xmm5 
addsd       xmm1,xmm0 
orpd        xmm2,xmm7 
unpckhpd    xmm0,xmm0 
addsd       xmm0,xmm1 
add         edx,37Eh 
cmp         edx,77Ch 
ja          ADJUST 
mulsd       xmm0,xmm2 
sub         esp,10h 
addsd       xmm0,xmm2 
movlpd      qword ptr [esp+4],xmm0 
fld         qword ptr [esp+4] 
add         esp,10h 
ret              


回答2:

Speaking generally, the complexity for primitive types should be very fast. As the commenters mention, there are sometimes instructions for it, and if not there are well-known fast algorithms, Knuth has a good section on such things.

The usual implementation for exponentiation is square-and-multiply which makes use of the observation that you can break any exponentiation up into some number of squarings plus at most one more multiply. The basic algorithm for n**k is given here and is O( lg k).



回答3:

Here one can find a fast exp implementation that uses the SSE instructions.



回答4:

Are you interested in the time taken by exponentiation, as compared to time taken by other floating-point operations? That will vary from implementation to implementation, and also from computer to computer (one may have a different math processor), so there's no one answer we can give.

The right approach, if you want to know, is to write test functions and time them. Loop through a million floating-point assignments and time it, then loop through a million floating-point assignments of exponentials and time that, then subtract. Watch out for the optimizer on this one, as if you don't use the result of the assignments it's allowed to remove the whole loop. You'll know that by extremely fast runtimes that don't vary with the size of the loop.