The purpose of the next two code section is to print number in binary.
The first one does this by two instructions (_bittest), while the second does it by pure arithmetic instructions which is three instructions.
the first code section:
#include <intrin.h>
#include <stdio.h>
#include <Windows.h>
long num = 78002;
int main()
{
unsigned char bits[32];
long nBit;
LARGE_INTEGER a, b, f;
QueryPerformanceCounter(&a);
for (size_t i = 0; i < 100000000; i++)
{
for (nBit = 0; nBit < 31; nBit++)
{
bits[nBit] = _bittest(&num, nBit);
}
}
QueryPerformanceCounter(&b);
QueryPerformanceFrequency(&f);
printf_s("time is: %f\n", ((float)b.QuadPart - (float)a.QuadPart) / (float)f.QuadPart);
printf_s("Binary representation:\n");
while (nBit--)
{
if (bits[nBit])
printf_s("1");
else
printf_s("0");
}
return 0;
}
the inner loop is compile to the instructions bt and setb
The second code section:
#include <intrin.h>
#include <stdio.h>
#include <Windows.h>
long num = 78002;
int main()
{
unsigned char bits[32];
long nBit;
LARGE_INTEGER a, b, f;
QueryPerformanceCounter(&a);
for (size_t i = 0; i < 100000000; i++)
{
long curBit = 1;
for (nBit = 0; nBit < 31; nBit++)
{
bits[nBit] = (num&curBit) >> nBit;
curBit <<= 1;
}
}
QueryPerformanceCounter(&b);
QueryPerformanceFrequency(&f);
printf_s("time is: %f\n", ((float)b.QuadPart - (float)a.QuadPart) / (float)f.QuadPart);
printf_s("Binary representation:\n");
while (nBit--)
{
if (bits[nBit])
printf_s("1");
else
printf_s("0");
}
return 0;
}
The inner loop compile to and add(as shift left) and sar.
the second code section run three time faster then the first one.
Why three cpu instructions run faster then two?
I'm assuming you're using x86-64 MSVC CL19 (or something that makes similar code).
_bittest
is slower because MSVC does a horrible job and keeps the value in memory andbt [mem], reg
is much slower thanbt reg,reg
. This is a compiler missed-optimization. It happens even if you makenum
a local variable instead of a global, even when the initializer is still constant!I included some perf analysis for Intel Sandybridge-family CPUs because they're common; you didn't say and yes it matters:
bt [mem], reg
has one per 3 cycle throughput on Ryzen, one per 5 cycle throughput on Haswell. And other perf characteristics differ...(For just looking at the asm, it's usually a good idea to make a function with args to get code the compiler can't do constant-propagation on. It can't in this case because it doesn't know if anything modifies
num
beforemain
runs, because it's notstatic
.)Your instruction-counting didn't include the whole loop so your counts are wrong, but more importantly you didn't consider the different costs of different instructions. (See Agner Fog's instruction tables and optimization manual.)
This is your whole inner loop with the
_bittest
intrinsic, with uop counts for Haswell / Skylake:Asm output from MSVC CL19
-Ox
on the Godbolt compiler explorerThat's 15 fused-domain uops, so it can issue (at 4 per clock) in 3.75 cycles. But that's not the bottleneck: Agner Fog's testing found that
bt [mem], reg
has a throughput of one per 5 clock cycles.IDK why it's 3x slower than your other loop. Maybe the other ALU instructions compete for the same port as the
bt
, or the data dependency it's part of causes a problem, or just being a micro-coded instruction is a problem, or maybe the outer loop is less efficient?Anyway, using
bt [mem], reg
instead ofbt reg, reg
is a major missed optimization. This loop would have been faster than your other loop with a 1 uop, 1c latency, 2-per-clock throughputbt r9d, ebx
.Huh? Those are the instructions MSVC associates with the
curBit <<= 1;
source line (even though that line is fully implemented by theadd self,self
, and the variable-count arithmetic right shift is part of a different line.)But the whole loop is this clunky mess:
So this is 11 fused-domain uops, and takes 2.75 clock cycles per iteration to issue from the front-end.
I don't see any loop-carried dep chains longer than that front-end bottleneck, so it probably runs about that fast.
Copying
ebx
toecx
every iteration instead of just usingecx
as the loop counter (nBit
) is an obvious missed optimization. The shift-count is needed incl
for a variable-count shift (unless you enable BMI2 instructions, if MSVC can even do that.)There are major missed optimizations here (in the "fast" version), so you should probably write your source differently do hand-hold your compiler into making less bad code. It implements this fairly literally instead of transforming it into something the CPU can do efficiently, or using
bt reg,reg
/setc
How to do this fast in asm or with intrinsics
Use SSE2 / AVX. Get the right byte (containing the corresponding bit) into each byte element of a vector, and PANDN (to invert your vector) with a mask that has the right bit for that element. PCMPEQB against zero. That gives you 0 / -1. To get ASCII digits, use
_mm_sub_epi8(set1('0'), mask)
to subtract 0 or -1 (add 0 or 1) to ASCII'0'
, conditionally turning it into'1'
.The first steps of this (getting a vector of 0/-1 from a bitmask) is How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?.
pshufb
), I thinkpunpcklbw
/punpcklwd
(and maybepshufd
) is what you need to repeat each byte ofnum
8 times and make two 16-byte vectors.In scalar code, this is one way that runs at 1 bit->byte per clock. There are probably ways to do better without using SSE2 (storing multiple bytes at once to get around the 1 store per clock bottleneck that exists on all current CPUs), but why bother? Just use SSE2.
Unrolled by two to avoid bottlenecking on the front-end, so this can run at 1 bit per clock.
Not answer (Bo did), but the second inner loop version can be simplified a bit:
Has subtle difference (1 instruction less) with gcc 7.2 targetting 32b.
(I'm assuming 32b target, as you convert
long
into 32 bit array, which makes sense only on 32b target ... and I assume x86, as it includes<windows.h>
, so it's clearly for obsolete OS target, although I think windows now have even 64b version? (I don't care.))Answer:
Because the count of instructions only correlates with performance (usually fewer is better), but the modern x86 CPU is much more complex machine, translating the actual x86 instructions into micro-code before execution, transforming that further by things like out-of-order-execution and register renaming (to break false dependency chains), and then it executes the resulting microcode, with different units of CPU capable to execute only some micro-ops, so in ideal case you may get 2-3 micro-ops executed in parallel by the 2-3 units in single cycle, and in worst case you may be executing an complete micro-code loop implementing some complex x86 instruction taking several cycles to finish, blocking most of the CPU units.
Another factor is availability of data from memory and memory writes, a single cache miss, when the data must be fetched from higher level cache, or even memory itself, creates tens-to-hundreds cycles stall. Having compact data structures favouring predictable access patterns and not exhausting all cache-lines is paramount for exploiting maximum CPU performance.
If you are at stage "why 3 instructions are faster than 2 instructions", you pretty much can start with any x86 optimization article/book, and keep reading for few months or year(s), it's quite complex topic.
You may want to check this answer https://gamedev.stackexchange.com/q/27196 for further reading...
The difference is that the code
_bittest(&num, nBit);
uses a pointer tonum
, which makes the compiler store it in memory. And the memory access makes the code a lot slower.The other version stores all the variables in registers, and uses very fast register shifts and adds. No memory accesses.