FPTAN Example x86

According to Intel documentation, this is what FPTAN does:

Replace ST(0) with its approximate tangent and push 1 onto the FPU stack.

And this is a code I wrote in NASM:

section .data
    fVal: dd 4
    fSt0: dq 0.0
    fSt1: dq 0.0

section .text
    fldpi
    fdiv  dword[fVal]  ; divide pi by 4 and store result in ST(0).
    fptan
    fstp  qword[fSt0]  ; store ST(0)
    fstp  qword[fSt1]  ; store ST(1)

At this point the values of fSt0 and fSt1, I find are:

fSt0 = 5.60479e+044
fSt1 = -1.#IND

But, shouldn't fSt0 and fSt1 be both 1?

As Michael Petch has already pointed out in a comment, you have a simple typo. Instead of declaring fVal as a floating-point value (as intended), you declared it as a 32-bit integer. Change:

fVal: dd 4

to:

fVal: dd 4.0

Then your code will work as intended. It is correctly written.

If you wanted to take an integer input, you could do it by changing your code to use the FIDIV instruction. This instruction will first convert an integer to a double-precision floating-point value, and then do the divide:

fldpi
fidiv  dword [fVal]    ; st(0) = pi / fVal
fptan                  ; st(0) = tan(st(0))
                       ; st(1) = 1.0
fstp   qword [fSt0]
fstp   qword [fSt1]

But because the conversion is required, this is slightly less efficient than if you had just given the input as a floating-point value.

Note that, if you were going to do this, it would be more efficient on certain older CPUs to break up the load so that it was done separately from the division—e.g.,

fldpi
fild   dword [fVal]
fdivp  st(1), st(0)    ; st(0) = pi / fVal
fptan                  ; st(0) = tan(st(0))
                       ; st(1) = 1.0
fstp   qword [fSt0]
fstp   qword [fSt1]

In other words, we break the FIDIV instruction apart into separate FILD (integer load) and FDIVP (divide-and-pop) instructions. This improves overlapping, and thus shaves off a couple of clock cycles from the execution speed of the code. (On newer CPUs, from AMD Family 15h [Bulldozer] and Intel Pentium II and later—there's no real advantage to breaking up FIDIV into FILD+FDIV; either way you write it should be equally performant.)

Of course, since everything you have here is a constant, and tan(pi/4) == 1, your code is equivalent to: