I believe integer addition or subtraction always take the same time no matter how big the operands are. Time needed for ALU output to be stabilized may vary over input operands, but CPU component that exploits ALU output will wait sufficiently long time so that any integer operation will be processed in SAME cycles.
(Cycles needed for ADD, SUB, MUL, and DIV will be different, but ADD will take the same cycles regardless of input operands, I think.)
Is this true for floating point operation, too?
I'm trying to implement a program which includes extensive floating point operations. I wonder if it is helpful to scale the numbers i'm dealing with for fast running time.
TL:DR: avoid denormal numbers and you're fine. If you don't need gradual underflow, set the Denormals Are Zero and Flush To Zero bits in the x86 MXCSR, or the equivalent for other architectures. In most CPUs, producing a denormal result traps to microcode, so it takes hundreds of cycles instead of 5.
See Agner Fog's insn tables for x86 CPU details, and also the x86 tag wiki.
It depends on your CPU, but typical modern FPUs are all similar in this respect.
Other than denormal operands, latency/throughput of add/sub/mul operations are not data-dependent on typical modern FPUs (including x86, ARM, and others). They're usually fully pipelined but with multi-cycle latency (i.e. a new MUL can begin execution every cycle, if its inputs are ready), which makes variable-latency inconvenient for out-of-order scheduling.
Variable latency would mean that two outputs would be ready in the same cycle, defeating the purpose of fully pipelining it, and making it impossible for the scheduler to reliably avoid conflicts like it does normally when dealing with known but mixed latency instructions / uops. (These lecture notes about in-order pipelines show how that's a structural hazard for write-back (WB), but the same idea applies for the ALU itself needing an extra buffer until it can hand off all the results it has ready.)
As an example on the high-performance end of the spectrum: Intel Haswell:
mulpd
(scalar, 128b or 256b vector of double-precision): 5c latency, two per 1c throughput (two separate ALUs).
- FMA: 5c latency, two per 1c throughput
addpd
/subpd
: 3c latency, one per 1c throughput. (But the add unit is on the same port as one of the mul/FMA units)
divpd
(scalar or 128b-vectors): 10-20c latency, one per 8-14c throughput. (Also on the same port as one of the mul/FMA units). Slower for 256b vectors (the div ALU isn't full-width). Somewhat faster for float
s, unlike add/sub/mul.
sqrtpd
: 16c latency, one per 8-14c throughput. Again not full width, and faster for float
.
rsqrtps
(fast very approximate, only available for float
): 5c latency, one per 1c throughput.
div/sqrt are the exception: their throughput and latency is data-dependent.
There are no fast parallel algorithms for div or sqrt, even in hardware. Some kind of iterative calculation is required, so fully pipelining would require duplicating lots of very similar hardware for each pipeline stage. Still, modern Intel x86 CPUs have partially-pipelined div and sqrt, with reciprocal throughput less than latency.
Compared to mul, div/sqrt have much lower throughput (~1/10th or worse), and significantly higher latency (~2x to 4x). The not-fully-pipelined nature of the div/sqrt unit in modern FPUs means that it can be variable latency without causing too many collisions at the ALU output port.
SSE/AVX doesn't implement sin/cos/exp/log as single instructions; math libraries should code their own.
Many good math libraries didn't use x87 fsin
either even before SSE existed; it's micro-coded on all existing implementations, so the internal implementation uses the same 80-bit add/sub/mul/div/sqrt hardware that you can program with simple instructions; there's no dedicated fsin
hardware (or at least not much; maybe a lookup table). Same for most of the other trig / transcendental x87 functions like fyl2x
.
It would be nice if there was some dedicated fsin
hardware, because range-reduction to +/- Pi/2 could really benefit from higher precision for inputs very near multiples of Pi/2. fsin
uses the same 80-bit Pi constant (with 64-bit mantissa) that you get from fldpi
. This is the nearest representable long double
to the exact value of Pi, and by chance the next two binary digits are zero, so it's actually accurate to 66 bits. But it still leads to a worst-case maximum error of 1.37 quintillion units in the last place, leaving fewer than four bits correct. (Bruce Dawson's series of articles about floating point are excellent, and you should definitely read them if you're about to write some floating point code. Index in this one.)
Intel couldn't improve the range-reduction precision of x87 fsin
without breaking numerical compatibility with existing CPUs. It's definitely useful for different x86 CPUs give numerically identical results when running the same instructions with the same inputs. In software, you could do range-reduction yourself with extended-precision floating point, like so-called double double to get quad precision (but still only the exponent range of double
). double double can be implemented fairly efficiently with SSE2 packed-double instructions. An SSE2 library implementation of fsin
might go for speed over precision and make the same trade-off as x87 hardware; using only a regular double
Pi constant for range reduction, leading to large errors in the worst cases. That would be a valid choice for some use-cases, and that's a big advantage of software: you can choose the right software implementation for your use case.
IDK about x87 exp or log instructions, like fyl2x
. They're microcoded, so they're nothing special for speed, but might be ok for accuracy. Still, a modern math library wouldn't copy a value from an xmm register to x87 just for that instruction. The x87 instruction is probably slower than what you can do with normal SSE math instructions. (And almost certainly not faster.)
For more about fast reciprocal and fast reciprocal sqrt, see Why is SSE scalar sqrt(x) slower than rsqrt(x) * x?
rsqrtps with a Newton-Raphson iteration is slightly less accurate than normal sqrtps. On Intel Haswell/Skylake, it's about the same latency IIRC, but may have better throughput. Without a NR iteration, it's too inaccurate for most uses.
Anyway, this has gotten quite x86-specific. The relative performance of mul vs. sqrt depends heavily on CPU microarchitecture, but even across x86 vs. ARM vs. most other modern CPUs with hardware FPUs, you should find that mul
and add
performance aren't data-dependent.