Which is faster: x<<1 or x<<10?

2019-01-21 12:45发布

问题:

I don't want to optimize anything, I swear, I just want to ask this question out of curiosity. I know that on most hardware there's an assembly command of bit-shift (e.g. shl, shr), which is a single command. But does it matter (nanosecond-wise, or CPU-tact-wise) how many bits you shift. In other words, is either of the following faster on any CPU?

x << 1;

and

x << 10;

And please don't hate me for this question. :)

回答1:

Potentially depends on the CPU.

However, all modern CPUs (x86, ARM) use a "barrel shifter" -- a hardware module specifically designed to perform arbitrary shifts in constant time.

So the bottom line is... no. No difference.



回答2:

Some embedded processors only have a "shift-by-one" instruction. On such processors, the compiler would change x << 3 into ((x << 1) << 1) << 1.

I think the Motorola MC68HCxx was one of the more popular families with this limitation. Fortunately, such architectures are now quite rare, most now include a barrel shifter with a variable shift size.

The Intel 8051, which has many modern derivatives, also cannot shift an arbitrary number of bits.



回答3:

There are many cases on this.

  1. Many hi-speed MPUs have barrel shifter, multiplexer-like electronic circuit which do any shift in constant time.

  2. If MPU have only 1 bit shift x << 10 would normally be slower, as it mostly done by 10 shifts or byte copying with 2 shifts.

  3. But there is known common case where x << 10 would be even faster than x << 1. If x is 16 bit, only lower 6 bits of it is care (all other will be shifted out), so MPU need to load only lower byte, thus make only single access cycle to 8-bit memory, while x << 10 need two access cycles. If access cycle is slower than shift (and clear lower byte), x << 10 will be faster. This may apply to microcontrollers with fast onboard program ROM while accessing slow external data RAM.

  4. As addition to case 3, compiler may care about number of significant bits in x << 10 and optimize further operations to lower-width ones, like replacing 16x16 multiplication with 16x8 one (as lower byte is always zero).

Note, some microcontrollers have no shift-left instruction at all, they use add x,x instead.



回答4:

On ARM, this can be done as a side effect of another instruction. So potentially, there's no latency at all for either of them.



回答5:

Here's my favorite CPU, in which x<<2 takes twice as long as x<<1 :)



回答6:

That depends both on the CPU and compiler. Even if the underlying CPU has arbitrary bit shift with a barrel shifter, this will only happen if the compiler takes advantage of that resource.

Keep in mind that shifting anything outside the width in bits of the data is "undefined behavior" in C and C++. Right shift of signed data is also "implementation defined". Rather than too much concern about speed, be concerned that you are getting the same answer on different implementations.

Quoting from ANSI C section 3.3.7:

3.3.7 Bitwise shift operators

Syntax

      shift-expression:
              additive-expression
              shift-expression <<  additive-expression
              shift-expression >>  additive-expression

Constraints

Each of the operands shall have integral type.

Semantics

The integral promotions are performed on each of the operands. The type of the result is that of the promoted left operand. If the value of the right operand is negative or is greater than or equal to the width in bits of the promoted left operand, the behavior is undefined.

The result of E1 << E2 is E1 left-shifted E2 bit positions; vacated bits are filled with zeros. If E1 has an unsigned type, the value of the result is E1 multiplied by the quantity, 2 raised to the power E2, reduced modulo ULONG_MAX+1 if E1 has type unsigned long, UINT_MAX+1 otherwise. (The constants ULONG_MAX and UINT_MAX are defined in the header .)

The result of E1 >> E2 is E1 right-shifted E2 bit positions. If E1 has an unsigned type or if E1 has a signed type and a nonnegative value, the value of the result is the integral part of the quotient of E1 divided by the quantity, 2 raised to the power E2 . If E1 has a signed type and a negative value, the resulting value is implementation-defined.

So:

x = y << z;

"<<": y × 2z (undefined if an overflow occurs);

x = y >> z;

">>": implementation-defined for signed (most often the result of the arithmetic shift: y / 2z).



回答7:

It is conceivable that, on an 8-bit processor, x<<1 could actually be much slower than x<<10 for a 16-bit value.

For example a reasonable translation of x<<1 may be:

byte1 = (byte1 << 1) | (byte2 >> 7)
byte2 = (byte2 << 1)

whereas x<<10 would be more simple:

byte1 = (byte2 << 2)
byte2 = 0

Notice how x<<1 shifts more often and even farther than x<<10. Furthermore the result of x<<10 doesn't depend on the content of byte1. This could speed up the operation additionally.



回答8:

On some generations of Intel CPUs (P2 or P3? Not AMD though, if I remember right), the bitshift operations are ridiculously slow. Bitshift by 1 bit should always be fast though since it can just use addition. Another question to consider is whether bitshifts by a constant number of bits are faster than variable-length shifts. Even if the opcodes are the same speed, on x86 the nonconstant righthand operand of a bitshift must occupy the CL register, which imposes additional constrains on register allocation and may slow the program down that way too.



回答9:

As always, it depends on the surrounding code context: e.g. are you using x<<1 as an array index? Or adding it to something else? In either case, small shift counts (1 or 2) can often optimize even more than if the compiler ends up having to just shift. Not to mention the whole throughput vs. latency vs. front-end bottlenecks tradeoff. Performance of a tiny fragment is not one-dimensional.

A hardware shift instructions is not a compiler's only option for compiling x<<1, but the other answers are mostly assuming that.


x << 1 is exactly equivalent to x+x for unsigned, and for 2's complement signed integers. Compilers always know what hardware they're targeting while they're compiling, so they can take advantage of tricks like this.

On Intel Haswell, add has 4 per clock throughput, but shl with an immediate count has only 2 per clock throughput. (See http://agner.org/optimize/ for instruction tables, and other links in the x86 tag wiki). SIMD vector shifts are 1 per clock (2 in Skylake), but SIMD vector integer adds are 2 per clock (3 in Skylake). Latency is the same, though: 1 cycle.

There's also a special shift-by-one encoding of shl where the count is implicit in the opcode. 8086 didn't have immediate-count shifts, only by-one and by cl register. This is mostly relevant for right-shifts, because you can just add for left shifts unless you're shifting a memory operand. But if the value is needed later, it's better to load into a register first. But anyway, shl eax,1 or add eax,eax is one byte shorter than shl eax,10, and code-size can directly (decode / front-end bottlenecks) or indirectly (L1I code cache misses) affect performance.

More generally, small shift counts can sometimes be optimized into a scaled index in an addressing mode on x86. Most other architectures in common use these days are RISC, and don't have scaled-index addressing modes, but x86 is a common enough architecture for this to be worth mentioning. (e.g.g if you're indexing an array of 4-byte elements, there's room to increase the scale factor by 1 for int arr[]; arr[x<<1]).


Needing to copy+shift is common in situations where the original value of x is still needed. But most x86 integer instructions operate in-place. (The destination is one of the sources for instructions like add or shl.) The x86-64 System V calling convention passes args in registers, with the first arg in edi and return value in eax, so a function that returns x<<10 also makes the compiler emit copy+shift code.

The LEA instruction lets you shift-and-add (with a shift count of 0 to 3, because it uses addressing-mode machine-encoding). It puts the result in a separate register.

gcc and clang both optimize these functions the same way, as you can see on the Godbolt compiler explorer:

int shl1(int x) { return x<<1; }
    lea     eax, [rdi+rdi]   # 1 cycle latency, 1 uop
    ret

int shl2(int x) { return x<<2; }
    lea     eax, [4*rdi]    # longer encoding: needs a disp32 of 0 because there's no base register, only scaled-index.
    ret

int times5(int x) { return x * 5; }
    lea     eax, [rdi + 4*rdi]
    ret

int shl10(int x) { return x<<10; }
    mov     eax, edi         # 1 uop, 0 or 1 cycle latency
    shl     eax, 10          # 1 uop, 1 cycle latency
    ret

LEA with 2 components has 1 cycle latency and 2-per-clock throughput on recent Intel and AMD CPUs. (Sandybridge-family and Bulldozer/Ryzen). On Intel, it's only 1 per clock throughput with 3c latency for lea eax, [rdi + rsi + 123]. (Related: Why is this C++ code faster than my hand-written assembly for testing the Collatz conjecture? goes into this in detail.)

Anyway, copy+shift by 10 needs a separate mov instruction. It might be zero latency on many recent CPUs, but it still takes front-end bandwidth and code size. (Can x86's MOV really be "free"? Why can't I reproduce this at all?)

Also related: How to multiply a register by 37 using only 2 consecutive leal instructions in x86?.


The compiler is also free to transform the surrounding code so there isn't an actual shift, or it's combined with other operations.

For example if(x<<1) { } could use an and to check all bits except the high bit. On x86, you'd use a test instruction, like test eax, 0x7fffffff / jz .false instead of shl eax,1 / jz. This optimization works for any shift count, and it also works on machines where large-count shifts are slow (like Pentium 4), or non-existent (some micro-controllers).

Many ISAs have bit-manipulation instructions beyond just shifting. e.g. PowerPC has a lot of bit-field extract / insert instructions. Or ARM has shifts of source operands as part of any other instruction. (So shift/rotate instructions are just a special form of move, using a shifted source.)

Remember, C is not assembly language. Always look at optimized compiler output when you're tuning your source code to compile efficiently.