FPTAN Example x86

According to Intel documentation, this is what FPTAN does:

Replace ST(0) with its approximate tangent and push 1 onto the FPU stack.

And this is a code I wrote in NASM:

section .data
    fVal: dd 4
    fSt0: dq 0.0
    fSt1: dq 0.0

section .text
    fldpi
    fdiv  dword[fVal]  ; divide pi by 4 and store result in ST(0).
    fptan
    fstp  qword[fSt0]  ; store ST(0)
    fstp  qword[fSt1]  ; store ST(1)

At this point the values of fSt0 and fSt1, I find are:

fSt0 = 5.60479e+044
fSt1 = -1.#IND

But, shouldn't fSt0 and fSt1 be both 1?

标签： assembly x86 x87

1条回答

Melony?

2楼-- · 2019-07-23 13:15

As Michael Petch has already pointed out in a comment, you have a simple typo. Instead of declaring fVal as a floating-point value (as intended), you declared it as a 32-bit integer. Change:

fVal: dd 4

to:

fVal: dd 4.0

Then your code will work as intended. It is correctly written.

If you wanted to take an integer input, you could do it by changing your code to use the FIDIV instruction. This instruction will first convert an integer to a double-precision floating-point value, and then do the divide:

fldpi
fidiv  dword [fVal]    ; st(0) = pi / fVal
fptan                  ; st(0) = tan(st(0))
                       ; st(1) = 1.0
fstp   qword [fSt0]
fstp   qword [fSt1]

But because the conversion is required, this is slightly less efficient than if you had just given the input as a floating-point value.

Note that, if you were going to do this, it would be more efficient on certain older CPUs to break up the load so that it was done separately from the division—e.g.,

fldpi
fild   dword [fVal]
fdivp  st(1), st(0)    ; st(0) = pi / fVal
fptan                  ; st(0) = tan(st(0))
                       ; st(1) = 1.0
fstp   qword [fSt0]
fstp   qword [fSt1]

In other words, we break the FIDIV instruction apart into separate FILD (integer load) and FDIVP (divide-and-pop) instructions. This improves overlapping, and thus shaves off a couple of clock cycles from the execution speed of the code. (On newer CPUs, from AMD Family 15h [Bulldozer] and Intel Pentium II and later—there's no real advantage to breaking up FIDIV into FILD+FDIV; either way you write it should be equally performant.)

Of course, since everything you have here is a constant, and tan(pi/4) == 1, your code is equivalent to:

fld1
fld1

…which is what an optimizing compiler would generate. :-)

0人赞添加讨论(0) 举报

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间