According to Intel documentation, this is what FPTAN
does:
Replace ST(0) with its approximate tangent and push 1 onto the FPU stack.
And this is a code I wrote in NASM:
section .data
fVal: dd 4
fSt0: dq 0.0
fSt1: dq 0.0
section .text
fldpi
fdiv dword[fVal] ; divide pi by 4 and store result in ST(0).
fptan
fstp qword[fSt0] ; store ST(0)
fstp qword[fSt1] ; store ST(1)
At this point the values of fSt0
and fSt1
, I find are:
fSt0 = 5.60479e+044
fSt1 = -1.#IND
But, shouldn't fSt0
and fSt1
be both 1
?
As Michael Petch has already pointed out in a comment, you have a simple typo. Instead of declaring fVal
as a floating-point value (as intended), you declared it as a 32-bit integer. Change:
fVal: dd 4
to:
fVal: dd 4.0
Then your code will work as intended. It is correctly written.
If you wanted to take an integer input, you could do it by changing your code to use the FIDIV
instruction. This instruction will first convert an integer to a double-precision floating-point value, and then do the divide:
fldpi
fidiv dword [fVal] ; st(0) = pi / fVal
fptan ; st(0) = tan(st(0))
; st(1) = 1.0
fstp qword [fSt0]
fstp qword [fSt1]
But because the conversion is required, this is slightly less efficient than if you had just given the input as a floating-point value.
Note that, if you were going to do this, it would be more efficient on certain older CPUs to break up the load so that it was done separately from the division—e.g.,
fldpi
fild dword [fVal]
fdivp st(1), st(0) ; st(0) = pi / fVal
fptan ; st(0) = tan(st(0))
; st(1) = 1.0
fstp qword [fSt0]
fstp qword [fSt1]
In other words, we break the FIDIV
instruction apart into separate FILD
(integer load) and FDIVP
(divide-and-pop) instructions. This improves overlapping, and thus shaves off a couple of clock cycles from the execution speed of the code. (On newer CPUs, from AMD Family 15h [Bulldozer] and Intel Pentium II and later—there's no real advantage to breaking up FIDIV
into FILD
+FDIV
; either way you write it should be equally performant.)
Of course, since everything you have here is a constant, and tan(pi/4) == 1
, your code is equivalent to:
fld1
fld1
…which is what an optimizing compiler would generate. :-)