Fast implementation of a large integer counter (in

2020-07-27 05:45发布

站内文章 / C++

63 0

三岁会撩人

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

My goal is as the following,

Generate successive values, such that each new one was never generated before, until all possible values are generated. At this point, the counter start the same sequence again. The main point here is that, all possible values are generated without repetition (until the period is exhausted). It does not matter if the sequence is simple 0, 1, 2, 3,..., or in other order.

For example, if the range can be represented simply by an unsigned, then

void increment (unsigned &n) {++n;}

is enough. However, the integer range is larger than 64-bits. For example, in one place, I need to generated 256-bits sequence. A simple implementation is like the following, just to illustrate what I am trying to do,

typedef std::array<uint64_t, 4> ctr_type;
static constexpr uint64_t max = ~((uint64_t) 0);
void increment (ctr_type &ctr)
{
    if (ctr[0] < max) {++ctr[0]; return;}
    if (ctr[1] < max) {++ctr[1]; return;}
    if (ctr[2] < max) {++ctr[2]; return;}
    if (ctr[3] < max) {++ctr[3]; return;}
    ctr[0] = ctr[1] = ctr[2] = ctr[3] = 0;
}

So if ctr start with all zeros, then first ctr[0] is increased one by one until it reach max, and then ctr[1], and so on. If all 256-bits are set, then we reset it to all zero, and start again.

The problem is that, such implementation is surprisingly slow. My current improved version is sort of equivalent to the following,

void increment (ctr_type &ctr)
{
    std::size_t k = (!(~ctr[0])) + (!(~ctr[1])) + (!(~ctr[2])) + (!(~ctr[3]))
    if (k < 4)
        ++ctr[k];
    else
        memset(ctr.data(), 0, 32);

}

If the counter is only manipulated with the above increment function, and always start with zero, then ctr[k] == 0 if ctr[k - 1] == 0. And thus the value k will be the index of the first element that is less than the maximum.

I expected the first to be faster, since branch mis-prediction shall happen only once in every 2^64 iterations. The second, though mis-predication only happen every 2^256 iterations, it shall not make a difference. And apart from the branching, it needs four bitwise negation, four boolean negation, and three addition. Which might cost much more than the first.

However, both clang, gcc, or intel icpc generate binaries that the second was much faster.

My main question is that does anyone know if there any faster way to implement such a counter? It does not matter if the counter start by increasing the first integers or if it is implemented as an array of integers at all, as long as the algorithm generate all 2^256 combinations of 256-bits.

What makes things more complicated, I also need non uniform increment. For example, each time the counter is incremented by K where K > 1, but almost always remain a constant. My current implementation is similar to the above.

To provide some more context, one place I am using the counters is using them as input to AES-NI aesenc instructions. So distinct 128-bits integer (loaded into __m128i), after going through 10 (or 12 or 14, depending on the key size) rounds of the instructions, a distinct 128-bits integer is generated. If I generate one __m128i integer at once, then the cost of increment matters little. However, since aesenc has quite a bit latency, I generate integers by blocks. For example, I might have 4 blocks, ctr_type block[4], initialized equivalent to the following,

block[0]; // initialized to zero
block[1] = block[0]; increment(block[1]);
block[2] = block[1]; increment(block[2]);
block[3] = block[2]; increment(block[3]);

And each time I need new output, I increment each block[i] by 4, and generate 4 __m128i output at once. By interleaving instructions, overall I was able to increase the throughput, and reduce the cycles per bytes of output (cpB) from 6 to 0.9 when using 2 64-bits integers as the counter and 8 blocks. However, if instead, use 4 32-bits integers as counter, the throughput, measured as bytes per sec is reduced to half. I know for a fact that on x86-64, 64-bits integers could be faster than 32-bits in some situations. But I did not expect such simple increment operation makes such a big difference. I have carefully benchmarked the application, and the increment is indeed the one slow down the program. Since the loading into __m128i and store the __m128i output into usable 32-bits or 64-bits integers are done through aligned pointers, the only difference between the 32-bits and 64-bits version is how the counter is incremented. I expected that the AES-NI expected, after loading the integers into __m128i, shall dominate the performance. But when using 4 or 8 blocks, it was clearly not the case.

So to summary, my main question is that, if anyone know a way to improve the above counter implementation.

回答1:

It's not only slow, but impossible. The total energy of universe is insufficient for 2^256 bit changes. And that would require gray counter.

Next thing before optimization is to fix the original implementation

void increment (ctr_type &ctr)
{
    if (++ctr[0] != 0) return;
    if (++ctr[1] != 0) return;
    if (++ctr[2] != 0) return;
    ++ctr[3];
}

If each ctr[i] was not allowed to overflow to zero, the period would be just 4*(2^32), as in 0-9, 19,29,39,49,...99, 199,299,... and 1999,2999,3999,..., 9999.

As a reply to the comment -- it takes 2^64 iterations to have the first overflow. Being generous, upto 2^32 iterations could take place in a second, meaning that the program should run 2^32 seconds to have the first carry out. That's about 136 years.

EDIT

If the original implementation with 2^66 states is really what is wanted, then I'd suggest to change the interface and the functionality to something like:

  (*counter) += 1;
  while (*counter == 0)
  {
     counter++;  // Move to next word
     if (counter > tail_of_array) {
        counter = head_of_array;
        memset(counter,0, 16);
        break;
     }
  }

The point being, that the overflow is still very infrequent. Almost always there's just one word to be incremented.

回答2:

If you're using GCC or compilers with __int128 like Clang or ICC

unsigned __int128 H = 0, L = 0;
L++;
if (L == 0) H++;

On systems where __int128 isn't available

std::array<uint64_t, 4> c[4]{};
c[0]++;
if (c[0] == 0)
{
    c[1]++;
    if (c[1] == 0)
    {
        c[2]++;
        if (c[2] == 0)
        {
            c[3]++;
        }
    }
}

In inline assembly it's much easier to do this using the carry flag. Unfortunately most high level languages don't have means to access it directly. Some compilers do have intrinsics for adding with carry like __builtin_uaddll_overflow in GCC and __builtin_addcll

Anyway this is rather wasting time since the total number of particles in the universe is only about 10⁸⁰ and you cannot even count up the 64-bit counter in your life

回答3:

Neither of your counter versions increment correctly. Instead of counting up to UINT256_MAX, you are actually just counting up to UINT64_MAX 4 times and then starting back at 0 again. This is apparent from the fact that you do not bother to clear any of the indices that has reached the max value until all of them have reached the max value. If you are measuring performance based on how often the counter reaches all bits 0, then this is why. Thus your algorithms do not generate all combinations of 256 bits, which is a stated requirement.

回答4:

You mention "Generate successive values, such that each new one was never generated before"

To generate a set of such values, look at linear congruential generators

the sequence x = (x*1 + 1) % (power_of_2), you thought about it, this are simply sequential numbers.
the sequence x = (x*13 + 137) % (power of 2) , this generates unique numbers with a predictable period (power_of_2 - 1) and the unique numbers look more "random", kind of pseudo-random. You need to resort to arbitrary precision arithmetic to get it working, and also all the trickeries of multiplications by constants. This will get you a nice way to start.

You also complain that your simple code is "slow"

At 4.2 GHz frequency, running 4 intructions per cycle and using AVX512 vectorizations, on a 64-core computer with a multithreaded version of your program doing nothing else than increments, you get only 64x8x4*2³²=8796093022208 increments per second, that is 2⁶⁴ increments reached in 25 days. This post is old, you might have reached 841632698362998292480 by now, running such a program on such a machine, and you will gloriously reach 1683265396725996584960 in 2 years time.

You also require "until all possible values are generated".

You can only generate a finite number of values, depending how much you are willing to pay for the energy to power your computers. As mentioned in the other responses, with 128 or 256-bit numbers, even being the richest man in the world, you will never wrap around before the first of these conditions occurs:

getting out of money
end of humankind (nobody will get the outcome of your software)
burning the energy from the last particles of the universe

回答5:

Multi-word addition can easily be accomplished in portable fashion by using three macros that mimic three types of addition instructions found on many processors:

ADDcc adds two words, and sets the carry if their was unsigned overflow
ADDC adds two words plus carry (from a previous addition)
ADDCcc adds two words plus carry, and sets the carry if their was unsigned overflow

A multi-word addition with two words uses ADDcc of the least significant words followed by ADCC of the most significant words. A multi-word addition with more than two words forms sequence ADDcc, ADDCcc, ..., ADDC. The MIPS architecture is a processor architecture without conditions code and therefore without carry flag. The macro implementations shown below basically follow the techniques used on MIPS processors for multi-word additions.

The ISO-C99 code below shows the operation of a 32-bit counter and a 64-bit counter based on 16-bit "words". I chose arrays as the underlying data structure, but one might also use struct, for example. Use of a struct will be significantly faster if each operand only comprises a few words, as the overhead of array indexing is eliminated. One would want to use the widest available integer type for each "word" for best performance. In the example from the question that would likely be a 256-bit counter comprising four uint64_t components.

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>

#define ADDCcc(a,b,cy,t0,t1) \
  (t0=(b)+cy, t1=(a), cy=t0<cy, t0=t0+t1, t1=t0<t1, cy=cy+t1, t0=t0)

#define ADDcc(a,b,cy,t0,t1) \
  (t0=(b), t1=(a), t0=t0+t1, cy=t0<t1, t0=t0)

#define ADDC(a,b,cy,t0,t1) \
  (t0=(b)+cy, t1=(a), t0+t1)

typedef uint16_t T;

/* increment a multi-word counter comprising n words */
void inc_array (T *counter, const T *increment, int n)
{
    T cy, t0, t1;
    counter [0] = ADDcc (counter [0], increment [0], cy, t0, t1);
    for (int i = 1; i < (n - 1); i++) {
        counter [i] = ADDCcc (counter [i], increment [i], cy, t0, t1);
    }
    counter [n-1] = ADDC (counter [n-1], increment [n-1], cy, t0, t1);
}

#define INCREMENT (10)
#define UINT32_ARRAY_LEN (2)
#define UINT64_ARRAY_LEN (4)

int main (void)
{
    uint32_t count32 = 0, incr32 = INCREMENT;
    T count_arr2 [UINT32_ARRAY_LEN] = {0};
    T incr_arr2  [UINT32_ARRAY_LEN] = {INCREMENT};
    do {
        count32 = count32 + incr32;
        inc_array (count_arr2, incr_arr2, UINT32_ARRAY_LEN);
    } while (count32 < (0U - INCREMENT - 1));
    printf ("count32 = %08x  arr_count = %08x\n", 
            count32, (((uint32_t)count_arr2 [1] << 16) +
                      ((uint32_t)count_arr2 [0] <<  0)));

    uint64_t count64 = 0, incr64 = INCREMENT;
    T count_arr4 [UINT64_ARRAY_LEN] = {0};
    T incr_arr4  [UINT64_ARRAY_LEN] = {INCREMENT};
    do {
        count64 = count64 + incr64;
        inc_array (count_arr4, incr_arr4, UINT64_ARRAY_LEN);
    } while (count64 < 0xa987654321ULL);
    printf ("count64 = %016llx  arr_count = %016llx\n", 
            count64, (((uint64_t)count_arr4 [3] << 48) + 
                      ((uint64_t)count_arr4 [2] << 32) +
                      ((uint64_t)count_arr4 [1] << 16) +
                      ((uint64_t)count_arr4 [0] <<  0)));
    return EXIT_SUCCESS;
}

Compiled with full optimization, the 32-bit example executes in about a second, while the 64-bit example runs for about a minute on a modern PC. The output of the program should look like so:

count32 = fffffffa  arr_count = fffffffa
count64 = 000000a987654326  arr_count = 000000a987654326

Non-portable code that is based on inline assembly or proprietary extensions for wide integer types may execute about two to three times as fast as the portable solution presented here.