Say %edi contains x and I want to end up with 37*x using only 2 consecutive leal instructions, how would I go about this?
For example to get 45x you would do
leal (%edi, %edi, 8), %edi
leal (%edi, %edi, 4), %eax (to be returned)
I cannot for the life of me figure out what numbers to put in place of the 8 and 4 so that the result (%eax) will be 37x
At
-O3
, gcc will emit (Godbolt compiler explorer):That's using
37 = 9*4 + 1
, not destroying the originala
value with the firstlea
so it can use both in the 2nd.You're in good company in not spotting this one, though: recent clang (3.8 and newer) will normally use 2
lea
instructions instead of animul
(e.g. for*15
), but it misses this one and uses:It does do
*21
with the same pattern as gcc uses, as5*4 + 1
. (clang3.6 and earlier always usedimul
unless there was a single-instruction alternativeshl
orlea
)ICC and MSVC also use imul, but they don't seem to like using 2
lea
instructions, so theimul
is "on purpose" there.See the godbolt link for a variety of multipliers with gcc7.2 vs. clang5.0. It's interesting to try
gcc -m32 -mtune=pentium
or evenpentium3
to see how many more instructions gcc was wiling to use back then. Although P2/P3 has 4-cycle latency forimul r, r, i
, so that's kinda crazy. Pentium has 9 cycleimul
and no OOO to hide the latency, so it makes sense to try hard to avoid it.mtune=silvermont
should probably only be willing to replace 32-bitimul
with a single instruction, because it has 3-cycle latency / 1c throughput multiply, but decode is often the bottleneck (according to Agner Fog, http://agner.org/optimize/). You could even considerimul $64, %edi, %eax
(or other powers of 2) instead ofmov
/shl
, because imul-immediate is a copy-and-multiply.Ironically,
gcc
misses the* 45
case, and usesimul
, while clang uses 2lea
s. Guess it's time to file some missed-optimization bug reports. If 2 LEAs are better than 1 IMUL, they should be used wherever possible.Older clang (3.7 and older) uses
imul
unless a singlelea
will do the trick. I haven't looked up the changelog to see if they did benchmarks to decide to favour latency over throughput.Related: Using LEA on values that aren't addresses / pointers? canonical answer about why LEA uses memory-operand syntax and machine encoding, even though it's a shift+add instruction (and runs on an ALU, not AGU, in most modern microarchitectures.)