(Related: How to quickly count bits into separate bins in a series of ints on Sandy Bridge? is an earlier duplicate of this, with some different answers. Editor's note: the answers here are probably better.
Also, an AVX2 version of a similar problem, with many bins for a whole row of bits much wider than one uint64_t
: Improve column population count algorithm)
I am working on a project in C where I need to go through tens of millions of masks (of type ulong (64-bit)) and update an array (called target
) of 64 short integers (uint16) based on a simple rule:
// for any given mask, do the following loop
for (i = 0; i < 64; i++) {
if (mask & (1ull << i)) {
target[i]++
}
}
The problem is that I need do the above loops on tens of millions of masks and I need to finish in less than a second. Wonder if there are any way to speed it up, like using some sort special assembly instruction that represents the above loop.
Currently I use gcc 4.8.4 on ubuntu 14.04 (i7-2670QM, supporting AVX, not AVX2) to compile and run the following code and took about 2 seconds. Would love to make it run under 200ms.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/stat.h>
double getTS() {
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec + tv.tv_usec / 1000000.0;
}
unsigned int target[64];
int main(int argc, char *argv[]) {
int i, j;
unsigned long x = 123;
unsigned long m = 1;
char *p = malloc(8 * 10000000);
if (!p) {
printf("failed to allocate\n");
exit(0);
}
memset(p, 0xff, 80000000);
printf("p=%p\n", p);
unsigned long *pLong = (unsigned long*)p;
double start = getTS();
for (j = 0; j < 10000000; j++) {
m = 1;
for (i = 0; i < 64; i++) {
if ((pLong[j] & m) == m) {
target[i]++;
}
m = (m << 1);
}
}
printf("took %f secs\n", getTS() - start);
return 0;
}
Thanks in advance!
One way of speeding this up significantly, even without AVX, is to split the data into blocks of up to 255 elements, and accumulate the bit counts byte-wise in ordinary
uint64_t
variables. Since the source data has 64 bits, we need an array of 8 byte-wise accumulators. The first accumulator counts bits in positions 0, 8, 16, ... 56, second accumulator counts bits in positions 1, 9, 17, ... 57; and so on. After we are finished processing a block of data, we transfers the counts form the byte-wise accumulator into thetarget
counts. A function to update thetarget
counts for a block of up to 255 numbers can be coded in a straightforward fashion according to the description above, whereBITS
is the number of bits in the source data:The entire ISO-C99 program, which should be able to run on at least Windows and Linux platforms is shown below. It initializes the source data with a PRNG, performs a correctness check against the asker's reference implementation, and benchmarks both the reference code and the accelerated version. On my machine (Intel Xeon E3-1270 v2 @ 3.50 GHz), when compiled with MSVS 2010 at full optimization (
/Ox
), the output of the program is:where
ref
refers to the asker's original solution. The speed-up here is about a factor 74x. Different speed-ups will be observed with other (and especially newer) compilers.Related:
Also: https://github.com/mklarqvist/positional-popcount has SSE blend, various AVX2, various AVX512 including Harley-Seal which is great for large arrays, and various other algorithms for positional popcount. Possibly only for
uint16_t
, but most could be adapted for other word widths. I think the algorithm I propose below is what they calladder_forest
.Your best bet is SIMD, using AVX1 on your Sandybridge CPU. Compilers aren't smart enough to auto-vectorize your loop-over-bits for you, even if you write it branchlessly to give them a better chance.
And unfortunately not smart enough to auto-vectorize the fast version that gradually widens and adds.
See is there an inverse instruction to the movemask instruction in intel avx2? for a summary of bitmap -> vector unpack methods for different sizes. Ext3h's suggestion in another answer is good: Unpack bits to something narrower than the final count array gives you more elements per instruction. Bytes is efficient with SIMD, and then you can do up to 255 vertical
paddb
without overflow, before unpacking to accumulate into the 32-bit counter array.It only takes 4x 16-byte
__m128i
vectors to hold all 64uint8_t
elements, so those accumulators can stay in registers, only adding to memory when widening out to 32-bit counters in an outer loop.The unpack doesn't have to be in-order: you can always shuffle
target[]
once at the very end, after accumulating all the results.The inner loop could be unrolled to start with a 64 or 128-bit vector load, and unpack 4 or 8 different ways using
pshufb
(_mm_shuffle_epi8
).An even better strategy is to widen gradually
Starting with 2-bit accumulators, then mask/shift to widen those to 4-bit. So in the inner-most loop most of the operations are working with "dense" data, not "diluting" it too much right away. Higher information / entropy density means that each instruction does more useful work.
Using SWAR techniques for 32x 2-bit add inside scalar or SIMD registers is easy / cheap because we need to avoid the possibility of carry out the top of an element anyway. With proper SIMD, we'd lose those counts, with SWAR we'd corrupt the next element.
Then you repeat up to 4 vectors of 4-bit elements, then 8 vectors of 8-bit elements, then you should widen all the way to 32 and accumulate into the array in memory because you'll run out of registers anyway, and this outer outer loop work is infrequent enough that we don't need to bother with going to 16-bit. (Especially if we manually vectorize).
Biggest downside: this doesn't auto-vectorize, unlike @njuffa's version. But with
gcc -O3 -march=sandybridge
for AVX1 (then running the code on Skylake), this running scalar 64-bit is actually still slightly faster than 128-bit AVX auto-vectorized asm from @njuffa's code.But that's timing on Skylake, which has 4 scalar ALU ports (and mov-elimination), while Sandybridge lacks mov-elimination and only has 3 ALU ports, so the scalar code will probably hit back-end execution-port bottlenecks. (But SIMD code may be nearly as fast, because there's plenty of AND / ADD mixed with the shifts, and SnB does have SIMD execution units on all 3 of its ports that have any ALUs on them. Haswell just added port 6, for scalar-only including shifts and branches.)
With good manual vectorization, this should be a factor of almost 2 or 4 faster.
But if you have to choose between this scalar or @njuffa's with AVX2 autovectorization, @njuffa's is faster on Skylake with
-march=native
If building on a 32-bit target is possible/required, this suffers a lot (without vectorization because of using uint64_t in 32-bit registers), while vectorized code barely suffers at all (because all the work happens in vector regs of the same width).
We don't care about order, so
accum4[0]
has 4-bit accumulators for every 4th bit, for example. The final fixup needed (but not yet implemented) at the very end is an 8x8 transpose of theuint32_t target[64]
array, which can be done efficiently using unpck andvshufps
with only AVX1. (Transpose an 8x8 float using AVX/AVX2). And also a cleanup loop for the last up to 251 masks.We can use any SIMD element width to implement these shifts; we have to mask anyway for widths lower than 16-bit (SSE/AVX doesn't have byte-granularity shifts, only 16-bit minimum.)
Benchmark results on Arch Linux i7-6700k from @njuffa's test harness, with this added. (Godbolt)
N = (10000000 / (3*4*21) * 3*4*21) = 9999864
(i.e. 10000000 rounded down to a multiple of the 252 iteration "unroll" factor, so my simplistic implementation is doing the same amount of work, not counting re-orderingtarget[]
which it doesn't do, so it does print mismatch results. But the printed counts match another position of the reference array.)I ran the program 4x in a row (to make sure the CPU was warmed up to max turbo) and took one of the runs that looked good (none of the 3 times abnormally high).
ref: the best bit-loop (next section)
fast: @njuffa's code. (auto-vectorized with 128-bit AVX integer instructions).
gradual: my version (not auto-vectorized by gcc or clang, at least not in the inner loop.) gcc and clang fully unroll the inner 12 iterations.
gcc8.2 -O3 -march=sandybridge -fpie -no-pie
ref: 0.331373 secs, fast: 0.011387 secs, gradual: 0.009966 secs
gcc8.2 -O3 -march=sandybridge -fno-pie -no-pie
ref: 0.397175 secs, fast: 0.011255 secs, gradual: 0.010018 secs
clang7.0 -O3 -march=sandybridge -fpie -no-pie
ref: 0.352381 secs, fast: 0.011926 secs, gradual: 0.009269 secs (very low counts for port 7 uops, clang used indexed addressing for stores)
clang7.0 -O3 -march=sandybridge -fno-pie -no-pie
ref: 0.293014 secs, fast: 0.011777 secs, gradual: 0.009235 secs
-march=skylake (allowing AVX2 for 256-bit integer vectors) helps both, but @njuffa's most because more of it vectorizes (including its inner-most loop):
gcc8.2 -O3 -march=skylake -fpie -no-pie
ref: 0.328725 secs, fast: 0.007621 secs, gradual: 0.010054 secs (gcc shows no gain for "gradual", only "fast")
gcc8.2 -O3 -march=skylake -fno-pie -no-pie
ref: 0.333922 secs, fast: 0.007620 secs, gradual: 0.009866 secs
clang7.0 -O3 -march=skylake -fpie -no-pie
ref: 0.260616 secs, fast: 0.007521 secs, gradual: 0.008535 secs (IDK why gradual is faster than -march=sandybridge; it's not using BMI1
andn
. I guess because it's using 256-bit AVX2 for the k=0..20 outer loop withvpaddq
)clang7.0 -O3 -march=skylake -fno-pie -no-pie
ref: 0.259159 secs, fast: 0.007496 secs, gradual: 0.008671 secs
Without AVX, just SSE4.2: (
-march=nehalem
), bizarrely clang's gradual is faster than with AVX / tune=sandybridge. "fast" is only barely slower than with AVX.gcc8.2 -O3 -march=skylake -fno-pie -no-pie
ref: 0.337178 secs, fast: 0.011983 secs, gradual: 0.010587 secs
clang7.0 -O3 -march=skylake -fno-pie -no-pie
ref: 0.293555 secs, fast: 0.012549 secs, gradual: 0.008697 secs
-fprofile-generate
/-fprofile-use
help some for GCC, especially for the "ref" version where it doesn't unroll at all by default.I highlighted the best, but often they're within measurement noise margin of each other. It's unsurprising the
-fno-pie -no-pie
was sometimes faster: indexing static arrays with[disp32 + reg]
is not an indexed addressing mode, just base + disp32, so it doesn't ever unlaminate on Sandybridge-family CPUs.But with gcc sometimes
-fpie
was faster; I didn't check but I assume gcc just shot itself in the foot somehow when 32-bit absolute addressing was possible. Or just innocent-looking differences in code-gen happened to cause alignment or uop-cache problems; I didn't check in detail.For SIMD, we can simply do 2 or 4x
uint64_t
in parallel, only accumulating horizontally in the final step where we widen bytes to 32-bit elements. (Perhaps by shuffling in-lane and then usingpmaddubsw
with a multiplier of_mm256_set1_epi8(1)
to add horizontal byte pairs into 16-bit elements.)TODO: manually-vectorized
__m128i
and__m256i
(and__m512i
) versions of this. Should be close to 2x, 4x, or even 8x faster than the "gradual" times above. Probably HW prefetch can still keep up with it, except maybe an AVX512 version with data coming from DRAM, especially if there's contention from other threads. We do a significant amount of work per qword we read.Obsolete code: improvements to the bit-loop
Your portable scalar version can be improved, too, speeding it up from ~1.92 seconds (with a 34% branch mispredict rate overall, with the fast loops commented out!) to ~0.35sec (
clang7.0 -O3 -march=sandybridge
) with a properly random input on 3.9GHz Skylake. Or 1.83 sec for the branchy version with!= 0
instead of== m
, because compilers fail to prove thatm
always has exactly 1 bit set and/or optimize accordingly.(vs. 0.01 sec for @njuffa's or my fast version above, so this is pretty useless in an absolute sense, but it's worth mentioning as a general optimization example of when to use branchless code.)
If you expect a random mix of zeros and ones, you want something branchless that won't mispredict. Doing
+= 0
for elements that were zero avoids that, and also means that the C abstract machine definitely touches that memory regardless of the data.Compilers aren't allowed to invent writes, so if they wanted to auto-vectorize your
if() target[i]++
version, they'd have to use a masked store like x86vmaskmovps
to avoid a non-atomic read / rewrite of unmodified elements oftarget
. So some hypothetical future compiler that can auto-vectorize the plain scalar code would have an easier time with this.Anyway, one way to write this is
target[i] += (pLong[j] & m != 0);
, using bool->int conversion to get a 0 / 1 integer.But we get better asm for x86 (and probably for most other architectures) if we just shift the data and isolate the low bit with
&1
. Compilers are kinda dumb and don't seem to spot this optimization. They do nicely optimize away the extra loop counter, and turnm <<= 1
intoadd same,same
to efficiently left shift, but they still use xor-zero /test
/setne
to create a 0 / 1 integer.An inner loop like this compiles slightly more efficiently (but still much much worse than we can do with SSE2 or AVX, or even scalar using @chrqlie's lookup table which will stay hot in L1d when used repeatedly like this, allowing SWAR in
uint64_t
):Note that
unsigned long
is not guaranteed to be a 64-bit type, and isn't in x86-64 System V x32 (ILP32 in 64-bit mode), and Windows x64. Or in 32-bit ABIs like i386 System V.Compiled on the Godbolt compiler explorer by gcc, clang, and ICC, it's 1 fewer uops in the loop with gcc. But all of them are just plain scalar, with clang and ICC unrolling by 2.
This is slightly better than we get from
test
/setnz
. Without unrolling,bt
/setc
might have been equal, but compilers are bad at usingbt
to implementbool (x & (1ULL << n))
, orbts
to implementx |= 1ULL << n
.If many words have their highest set bit far below bit 63, looping on
while(tmp)
could be a win. Branch mispredicts make it not worth it if it only saves ~0 to 4 iterations most of the time, but if it often saves 32 iterations, that could really be worth it. Maybe unroll in the source so the loop only teststmp
every 2 iterations (because compilers won't do that transformation for you), but then the loop branch can beshr rdx, 2
/jnz
.On Sandybridge-family, this is 11 fused-domain uops for the front end per 2 bits of input. (
add [mem], reg
with a non-indexed addressing mode micro-fuses the load+ALU, and the store-address+store-data, everything else is single-uop. add/jcc macro-fuses. See Agner Fog's guide, and https://stackoverflow.com/tags/x86/info). So it should run at something like 3 cycles per 2 bits = one uint64_t per 96 cycles. (Sandybridge doesn't "unroll" internally in its loop buffer, so non-multiple-of-4 uop counts basically round up, unlike on Haswell and later).vs. gcc's not-unrolled version being 7 uops per 1 bit = 2 cycles per bit. If you compiled with
gcc -O3 -march=native -fprofile-generate
/ test-run /gcc -O3 -march=native -fprofile-use
, profile-guided optimization would enable loop unrolling.This is probably slower than a branchy version on perfectly predictable data like you get from
memset
with any repeating byte pattern. I'd suggest filling your array with randomly-generated data from a fast PRNG like an SSE2 xorshift+, or if you're just timing the count loop then use anything you want, likerand()
.See
Efficient Computation of Positional Population Counts Using SIMD Instructions by Marcus D. R. Klarqvist, Wojciech Muła, Daniel Lemire (7 Nov 2019)
Faster Population Counts using AVX2 Instructions by Wojciech Muła, Nathan Kurz, Daniel Lemire (23 Nov 2016).
Basically, each full adder compresses 3 inputs to 2 outputs. So one can eliminate an entire 256-bit word for the price of 5 logic instructions. The full adder operation could be repeated until registers become exhausted. Then results in the registers are accumulated (as seen in most of the other answers).
Positional popcnt for 16-bit subwords is implemented here: https://github.com/mklarqvist/positional-popcount
Note: the accumulate step for positional-popcnt is more expensive than for normal simd popcnt. Which I believe makes it feasible to add a couple of half-adders to the end of the CSU, it might pay to go all the way up to 256 words before accumulating.
On my system, a 4 year old MacBook (2.7 GHz intel core i5) with
clang-900.0.39.2 -O3
, your code runs in 500ms.Just changing the inner test to
if ((pLong[j] & m) != 0)
saves 30%, running in 350ms.Further simplifying the inner part to
target[i] += (pLong[j] >> i) & 1;
without a test brings it down to 280ms.Further improvements seem to require more advanced techniques such as unpacking the bits into blocks of 8 ulongs and adding those in parallel, handling 255 ulongs at a time.
Here is an improved version using this method. it runs in 45ms on my system.
The technique for inflating a byte to a 64-bit long are investigated and explained in the answer: https://stackoverflow.com/a/55059914/4593267 . I made the
target
array a local variable, as well as theinflate
array, and I print the results to ensure the compiler will not optimize the computations away. In a production version you would compute theinflate
array separately.Using SIMD directly might provide further improvements at the expense of portability and readability. This kind of optimisation is often better left to the compiler as it can generate specific code for the target architecture. Unless performance is critical and benchmarking proves this to be a bottleneck, I would always favor a generic solution.
A different solution by njuffa provides similar performance without the need for a precomputed array. Depending on your compiler and hardware specifics, it might be faster.
For starters, the problem of unpacking the bits, because seriously, you do not want to test each bit individually.
So just follow the following strategy for unpacking the bits into bytes of a vector: https://stackoverflow.com/a/24242696/2879325
Now that you have padded each bit to 8 bits, you can just do this for blocks of up to 255 bitmasks at a time, and accumulate them all into a single vector register. After that, you would have to expect potential overflows, so you need to transfer.
After each block of 255, unpack again to 32bit, and add into the array. (You don't have to do exactly 255, just some convenient number less than 256 to avoid overflow of byte accumulators).
At 8 instructions per bitmask (4 per each lower and higher 32-bit with AVX2) - or half that if you have AVX512 available - you should be able to achieve a throughput of about half a billion bitmasks per second and core on an recent CPU.
Depending on the chose compiler, the vectorized form achieved roughly a factor 25 speedup over the naive one.
On a Ryzen 5 1600X, the vectorized form roughly achieved the predicted throughput of ~600,000,000 elements per second.
Surprisingly, this is actually still 50% slower than the solution proposed by @njuffa.