I am working on a very low level part of the application in which performance is critical.
While investigating the generated assembly, I noticed the following instruction:
lea eax,[edx*8+8]
I am used to seeing additions when using memory references (e.g. [edx+4]), but this is the first time I see a multiplication.
- Does this mean that the x86 processor can perform simple multiplications in the lea instruction?
- Does this multiplication have an impact on the number of cycles needed to execute the instruction?
- Is the multiplication limited to powers of 2 (I would assume this is the case)?
Thanks in advance.
To expand on my comment and to answer the rest of the question...
Yes, it's limited to powers of two. (2, 4, and 8 specifically) So no multiplier is needed since it's just a shift. The point of it is to quickly generate an address from an index variable and a pointer - where the datatype is a simple 2, 4, or 8 byte word. (Though it's often abused for other uses as well.)
As for the number of cycles that are needed: According to Agner Fog's tables it looks like the lea
instruction is constant on some machines and variable on others.
On Sandy Bridge there's a 2-cycle penalty if it's "complex or rip relative". But it doesn't say what "complex" means... So we can only guess unless you do a benchmark.
Actually, this is not something specific to the lea
instruction.
This type of addressing is called Scaled Addressing Mode
. The multiplication is achieved by a bit shift, which is trivial:
You could do 'scaled addressing' with a mov
too, for example (note that this is not the same operation, the only similarity is the fact that ebx*4
represents an address multiplication):
mov edx, [esi+4*ebx]
(source: http://www.cs.virginia.edu/~evans/cs216/guides/x86.html#memory)
For a more complete listing, see this Intel document. Table 2-3 shows that a scaling of 2, 4, or 8 is allowed. Nothing else.
Latency (in terms of number of cycles): I don't think this should be affected at all. A shift is a matter of connections, and selecting between three possible shifts is the matter of 1 multiplexer worth of delay.
To expand on your last question:
Is the multiplication limited to powers of 2 (I would assume this is the case)?
Note that you get the result of base + scale * index
, so while scale
has to be 1, 2, 4 or 8 (the size of x86 integer datatypes), you can get the equivalent of a multiplication by some different constants by using the same register as base
and index
, e.g.:
lea eax, [eax*4 + eax] ; multiply by 5
This is used by the compiler to do strength reduction, e.g: for a multiplication by 100, depending on compiler options (target CPU model, optimization options), you may get:
lea (%edx,%edx,4),%eax ; eax = orig_edx * 5
lea (%eax,%eax,4),%eax ; eax = eax * 5 = orig_edx * 25
shl $0x2,%eax ; eax = eax * 4 = orig_edx * 100