VS: unexpected optimization behavior with _BitScan

The following code works fine in debug mode, since _BitScanReverse64 is defined to return 0 if no Bit is set. Citing MSDN: (The return value is) "Nonzero if Index was set, or 0 if no set bits were found."

If I compile this code in release mode it still works, but if I enable compiler optimizations, such as \O1 or \O2 the index is not zero and the assert() fails.

#include <iostream>
#include <cassert>

using namespace std;

int main()
{
  unsigned long index = 0;
  _BitScanReverse64(&index, 0x0ull);

  cout << index << endl;

  assert(index == 0);

  return 0;
}

Is this the intended behaviour ? I am using Visual Studio Community 2015, Version 14.0.25431.01 Update 3. (I left cout in, so that the variable index is not deleted during optimization). Also is there an efficient workaround or should I just not use this compiler intrinsic directly?

AFAICT, the intrinsic leaves garbage in index when the input is zero, weaker than the behaviour of the asm instruction. This is why it has a separate boolean return value and integer output operand.

unsigned char _BitScanReverse64 (unsigned __int32* index, unsigned __int64 mask)
Intel's intrinsics guide documentation for the same intrinsic seems clearer than the Microsoft docs you linked, and sheds some light on what the MS docs are trying to say. But on careful reading, they do both seem to say the same thing, and describe a thin wrapper around the bsr instruction.

Intel documents the BSR instruction as producing an "undefined value" when the input is 0, but setting the ZF in that case. But AMD documents it as leaving the destination unchanged

On current Intel hardware, the actual behaviour matches AMD's documentation: it leaves the destination register unmodified when the src operand is 0. Perhaps this is why MS describes it as only setting Index when the input is non-zero (and the intrinsic's return value is non-zero).

IDK why Intel still hasn't documented it. Perhaps a really old x86 CPU (like original 386?) implements it differently? Intel and AMD frequently go above and beyond what's documented in the x86 manuals in order to not break existing code (e.g. Windows), which might be how this started. At this point it seems unlikely that they'd ever drop that output dependency and leave it actually garbage or -1 or 32 for input=0, but the lack of documentation leaves that option open.

Of course, since MSVC optimized away your index = 0 initialization, presumably it just uses whatever destination register it wants, not necessarily the register that held the previous value of the C variable. So even if you wanted to, I don't think you could take advantage of the dst-unmodified behaviour even though it's guaranteed on AMD.

So in C++ terms, the intrinsic has no input dependency on index. But in asm, the instruction does have an input dependency on the dst register, like an add dst, src instruction. This can cause unexpected performance issues if compilers aren't careful.

Unfortunately on Intel hardware, the popcnt / lzcnt / tzcnt asm instructions also have a false dependency on their destination, even though the result never depends on it. Compilers work around this now that it's known, though, so you don't have to worry about it when using intrinsics (unless you have a compiler more than a couple years old, since it was only recently discovered).

You need to check it to make sure index is valid, unless you know the input was non-zero. e.g.

if(_BitScanReverse64(&idx, input)) {
    // idx is valid.
    // (MS docs say "Index was set")
} else {
    // input was zero, idx holds garbage.
    // (MS docs don't say Index was even set)
    idx = -1;     // might make sense, one lower than the result for bsr(1)
}

If you want to avoid this extra check branch, you can use the lzcnt instruction via different intrinsics if you're targeting new enough hardware (e.g. Intel Haswell or AMD Bulldozer IIRC). It "works" even when the input is all-zero, and actually counts leading zeros instead of returning the index of the highest set bit.