可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

The purpose of the next two code section is to print number in binary.
The first one does this by two instructions (_bittest), while the second does it by pure arithmetic instructions which is three instructions.
the first code section:

#include <intrin.h>
#include <stdio.h>  
#include <Windows.h>

long num = 78002;
int main()
{
    unsigned char bits[32];
    long nBit;
    LARGE_INTEGER a, b, f;
    QueryPerformanceCounter(&a);
    for (size_t i = 0; i < 100000000; i++)
    {
        for (nBit = 0; nBit < 31; nBit++)
        {
            bits[nBit] = _bittest(&num, nBit);
        }
    }
    QueryPerformanceCounter(&b);
    QueryPerformanceFrequency(&f);
    printf_s("time is: %f\n", ((float)b.QuadPart - (float)a.QuadPart) / (float)f.QuadPart);

    printf_s("Binary representation:\n");
    while (nBit--)
    {
        if (bits[nBit])
            printf_s("1");
        else
            printf_s("0");
    }
    return 0;
}

the inner loop is compile to the instructions bt and setb
The second code section:

#include <intrin.h>
#include <stdio.h>  
#include <Windows.h>
long num = 78002;
int main()
{
    unsigned char bits[32];
    long nBit;

    LARGE_INTEGER a, b, f;
    QueryPerformanceCounter(&a);
    for (size_t i = 0; i < 100000000; i++)
    {
        long curBit = 1;
        for (nBit = 0; nBit < 31; nBit++)
        {
            bits[nBit] = (num&curBit) >> nBit;
            curBit <<= 1;
        }
    }
    QueryPerformanceCounter(&b);
    QueryPerformanceFrequency(&f);
    printf_s("time is: %f\n", ((float)b.QuadPart - (float)a.QuadPart) / (float)f.QuadPart);

    printf_s("Binary representation:\n");
    while (nBit--)
    {
        if (bits[nBit])
            printf_s("1");
        else
            printf_s("0");
    }
    return 0;
}

The inner loop compile to and add(as shift left) and sar.
the second code section run three time faster then the first one.

Why three cpu instructions run faster then two?

回答1:

Not answer (Bo did), but the second inner loop version can be simplified a bit:

    long numCopy = num;
    for (nBit = 0; nBit < 31; nBit++) {
        bits[nBit] = numCopy & 1;
        numCopy >>= 1;
    }

Has subtle difference (1 instruction less) with gcc 7.2 targetting 32b.

_{(I'm assuming 32b target, as you convert long into 32 bit array, which makes sense only on 32b target ... and I assume x86, as it includes <windows.h>, so it's clearly for obsolete OS target, although I think windows now have even 64b version? (I don't care.))}

Answer:

Why three cpu instructions run faster then two?

Because the count of instructions only correlates with performance (usually fewer is better), but the modern x86 CPU is much more complex machine, translating the actual x86 instructions into micro-code before execution, transforming that further by things like out-of-order-execution and register renaming (to break false dependency chains), and then it executes the resulting microcode, with different units of CPU capable to execute only some micro-ops, so in ideal case you may get 2-3 micro-ops executed in parallel by the 2-3 units in single cycle, and in worst case you may be executing an complete micro-code loop implementing some complex x86 instruction taking several cycles to finish, blocking most of the CPU units.

Another factor is availability of data from memory and memory writes, a single cache miss, when the data must be fetched from higher level cache, or even memory itself, creates tens-to-hundreds cycles stall. Having compact data structures favouring predictable access patterns and not exhausting all cache-lines is paramount for exploiting maximum CPU performance.

If you are at stage "why 3 instructions are faster than 2 instructions", you pretty much can start with any x86 optimization article/book, and keep reading for few months or year(s), it's quite complex topic.

You may want to check this answer https://gamedev.stackexchange.com/q/27196 for further reading...

回答2:

I'm assuming you're using x86-64 MSVC CL19 (or something that makes similar code).

_bittest is slower because MSVC does a horrible job and keeps the value in memory and bt [mem], reg is much slower than bt reg,reg. This is a compiler missed-optimization. It happens even if you make num a local variable instead of a global, even when the initializer is still constant!

I included some perf analysis for Intel Sandybridge-family CPUs because they're common; you didn't say and yes it matters: bt [mem], reg has one per 3 cycle throughput on Ryzen, one per 5 cycle throughput on Haswell. And other perf characteristics differ...

(For just looking at the asm, it's usually a good idea to make a function with args to get code the compiler can't do constant-propagation on. It can't in this case because it doesn't know if anything modifies num before main runs, because it's not static.)

Your instruction-counting didn't include the whole loop so your counts are wrong, but more importantly you didn't consider the different costs of different instructions. (See Agner Fog's instruction tables and optimization manual.)

This is your whole inner loop with the _bittest intrinsic, with uop counts for Haswell / Skylake:

    for (nBit = 0; nBit < 31; nBit++) {
        bits[nBit] = _bittest(&num, nBit);
        //bits[nBit] = (bool)(num & (1UL << nBit));   // much more efficient
    }

Asm output from MSVC CL19 -Ox on the Godbolt compiler explorer

$LL7@main:
    bt       DWORD PTR num, ebx          ; 10 uops (microcoded), one per 5 cycle throughput
    lea      rcx, QWORD PTR [rcx+1]      ; 1 uop
    setb     al                          ; 1 uop
    inc      ebx                         ; 1 uop
    mov      BYTE PTR [rcx-1], al        ; 1 uop (micro-fused store-address and store-data)
    cmp      ebx, 31
    jb       SHORT $LL7@main             ; 1 uop (macro-fused with cmp)

That's 15 fused-domain uops, so it can issue (at 4 per clock) in 3.75 cycles. But that's not the bottleneck: Agner Fog's testing found that bt [mem], reg has a throughput of one per 5 clock cycles.

IDK why it's 3x slower than your other loop. Maybe the other ALU instructions compete for the same port as the bt, or the data dependency it's part of causes a problem, or just being a micro-coded instruction is a problem, or maybe the outer loop is less efficient?

Anyway, using bt [mem], reg instead of bt reg, reg is a major missed optimization. This loop would have been faster than your other loop with a 1 uop, 1c latency, 2-per-clock throughput bt r9d, ebx.

The inner loop compile to and add(as shift left) and sar.

Huh? Those are the instructions MSVC associates with the curBit <<= 1; source line (even though that line is fully implemented by the add self,self, and the variable-count arithmetic right shift is part of a different line.)

But the whole loop is this clunky mess:

    long curBit = 1;
    for (nBit = 0; nBit < 31; nBit++)  {
        bits[nBit] = (num&curBit) >> nBit;
        curBit <<= 1;
    }

$LL18@main:               # MSVC CL19  -Ox
    mov      ecx, ebx                  ; 1 uop
    lea      r8, QWORD PTR [r8+1]      ; 1 uop   pointer-increment for bits
    mov      eax, r9d                  ; 1 uop.  r9d holds num
    inc      ebx                       ; 1 uop
    and      eax, edx                  ; 1 uop
       # MSVC says all the rest of these instructions are from             curBit <<= 1; but they're obviously not.
    add      edx, edx                  ; 1 uop
    sar      eax, cl                   ; 3 uops (variable-count shifts suck)
    mov      BYTE PTR [r8-1], al       ; 1 uop (micro-fused)
    cmp      ebx, 31
    jb       SHORT $LL18@main         ; 1 uop (macro-fused with cmp)

So this is 11 fused-domain uops, and takes 2.75 clock cycles per iteration to issue from the front-end.

I don't see any loop-carried dep chains longer than that front-end bottleneck, so it probably runs about that fast.

Copying ebx to ecx every iteration instead of just using ecx as the loop counter (nBit) is an obvious missed optimization. The shift-count is needed in cl for a variable-count shift (unless you enable BMI2 instructions, if MSVC can even do that.)

There are major missed optimizations here (in the "fast" version), so you should probably write your source differently do hand-hold your compiler into making less bad code. It implements this fairly literally instead of transforming it into something the CPU can do efficiently, or using bt reg,reg / setc

How to do this fast in asm or with intrinsics

Use SSE2 / AVX. Get the right byte (containing the corresponding bit) into each byte element of a vector, and PANDN (to invert your vector) with a mask that has the right bit for that element. PCMPEQB against zero. That gives you 0 / -1. To get ASCII digits, use _mm_sub_epi8(set1('0'), mask) to subtract 0 or -1 (add 0 or 1) to ASCII '0', conditionally turning it into '1'.

The first steps of this (getting a vector of 0/-1 from a bitmask) is How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?.

Fastest way to unpack 32 bits to a 32 byte SIMD vector (has a 128b version). Without SSSE3 (pshufb), I think punpcklbw / punpcklwd (and maybe pshufd) is what you need to repeat each byte of num 8 times and make two 16-byte vectors.
is there an inverse instruction to the movemask instruction in intel avx2?.

In scalar code, this is one way that runs at 1 bit->byte per clock. There are probably ways to do better without using SSE2 (storing multiple bytes at once to get around the 1 store per clock bottleneck that exists on all current CPUs), but why bother? Just use SSE2.

  mov    eax, [num]
  lea    rdi, [rsp + xxx]  ; bits[]
.loop:
    shr   eax, 1     ; constant-count shift is efficient (1 uop).  CF = last bit shifted out
    setc  [rdi]      ; 2 uops, but just as efficient as setc reg / mov [mem], reg

    shr   eax, 1
    setc  [rdi+1]

    add   rdi, 2
    cmp   end_pointer    ; compare against another register instead of a separate counter.
    jb   .loop

Unrolled by two to avoid bottlenecking on the front-end, so this can run at 1 bit per clock.

回答3:

The difference is that the code _bittest(&num, nBit); uses a pointer to num, which makes the compiler store it in memory. And the memory access makes the code a lot slower.

        bits[nBit] = _bittest(&num, nBit);
00007FF6D25110A0  bt          dword ptr [num (07FF6D2513034h)],ebx     ; <-----
00007FF6D25110A7  lea         rcx,[rcx+1]  
00007FF6D25110AB  setb        al  
00007FF6D25110AE  inc         ebx  
00007FF6D25110B0  mov         byte ptr [rcx-1],al

The other version stores all the variables in registers, and uses very fast register shifts and adds. No memory accesses.