Is worth to use mod to replace if statement for ci

2019-09-16 06:10发布

问题:

I need a variable to point the array index, and like circle to return 0 when it arrive the end of array. I know that I can use if statement to judge, but I'm not sure it's will more quickly or not to use mod to achieve same function, can anyone give me some advise?

int p=0;
int arr[10];
void add_index(){   
   if(p==9) p=0;
   else     p++;
}

Or

int p=0;
int arr[10];
void add_index(){
   p=(p+1)%10;
}

回答1:

Once upon a time, definitely yes. These days, possibly no!

I'll take the Intel Skylake as an example. The DIV instruction (which produces both the quotient and remainder, and is used for this sort of thing), operating on a 32-bit dividend and divisor, has a latency of 23 cycles and a reciprocal throughput of 6 cycles. That is, depending on how it's pipelined with other operations, the "cost" is 6-23 cycles. (Okay, it's a little more complicated than that due to execution ports, but work with me here.) A correctly predicted jump is 0.5-2 cycles depending on whether it was taken or not taken, and a mispredicted jump has a penalty of 16-17 cycles. (All hail Agner Fog for timings.)

Intel branch prediction hardware is really, really good. It's maybe too much to expect it to correctly predict that every ninth branch will be taken, but in an inner loop I'd at least expect it to predict it correctly the other 8 times. That means around 3.5 cycles average for the if-statement (not counting the sundry integer ops, which add maybe 1-2 cycles). Oh, and that's assuming that the compiler is being particularly derpy and not just using CMOV like it should.

The thing to keep in mind is that integer division is one of the slowest "normal" things you can do with a modern CPU. For modulus by a known divisor, though, you can instead use a special sequence of adds/muls/shifts. So in the case of the above code, where the divisor is a compile-time constant rather than taken from a variable, you might actually beat the DIV. Those sequences can be difficult to pipeline, so it's hard to say whether it'd actually be a win. In any case, modern compilers absolutely know tricks like that.

Bottom line: It's hard to say. If you're doing the operation a huge number of times in an inner loop, it might actually be worth trying it both ways and timing. Likely, though, you're not going to see a meaningful difference, and not one that would justify spending optimization time on it. But I often write code that needs to be extremely high-performance, and I used to default to modulus for PPC, and now I default to if/else for x64. (Well, ternary.)



回答2:

I wrote a little test and compile it with gcc -O4 optimization.

Here is add_index_mod and add_index_if implementations from this test:

void add_index_mod(int *p) {
    *p = (*p + 1) % 10;
}

void add_index_if(int *p) {
    if (*p == 9)
        *p = 0;
    else
        (*p)++;
}

And that's what I got for add_index_mod:

mov eax, dword [rdi]
mov edx, 0x66666667
lea ecx, dword [rax + 1]
mov eax, ecx
imul edx
mov eax, ecx
sar eax, 0x1f
sar edx, 2
sub edx, eax
lea eax, dword [rdx + rdx*4]
add eax, eax
sub ecx, eax
mov dword [rdi], ecx
ret

Here we can see that the compiler replaced div with sequence of mul, shifts and subs. This trick is well described here.

And that's what I got for add_index_if:

mov edx, dword [rdi]            
lea eax, dword [rdx + 1]        
cmp edx, 9                      
mov edx, 0                      
cmove eax, edx                  
mov dword [rdi], eax            
ret

Nothing special here just cmp and conditional mov.

So now you can try to calculate the efficiency of assembly code of both this functions using this table. But this is not the best way to go because of out of order execution, branch prediction and etc.

So as I mentioned above I just wrote a little test:

#include <stdio.h>
#include <stdint.h>

#define REPEATS (1 << 30)

static inline uint64_t rdtsc() {
  unsigned int hi, lo;
  __asm__ volatile("rdtsc" : "=a" (lo), "=d" (hi));
  return ((uint64_t)hi << 32) | lo;
}

void add_index_mod(int *p) {
    *p = (*p + 1) % 10;
}

void add_index_if(int *p) {
    if (*p == 9)
        *p = 0;
    else
        (*p)++;
}

int main() {
    int p = 0;
    uint32_t i;
    uint64_t start, stop;
    double delta, ticks_per_call;

    // mod ================================

    start = rdtsc();

    for (i = 0; i < REPEATS; ++i) {
        add_index_mod(&p);
    }

    stop = rdtsc();

    // gcc with -O4 can remove above loop
    // if we don't use its result so print it
    printf("%d\n", p);

    delta = (double)(stop - start);
    ticks_per_call = delta / REPEATS;
    printf("add_index_mod: %f\n", ticks_per_call);


    // if ================================

    start = rdtsc();

    for (i = 0; i < REPEATS; ++i) {
        add_index_if(&p);
    }

    stop = rdtsc();

    printf("%d\n", p);

    delta = (double)(stop - start);
    ticks_per_call = delta / REPEATS;
    printf("add_index_if: %f\n", ticks_per_call);

    return 0;
}

And here is its output for my Intel core i5-6500:

add_index_mod: 9.643092
add_index_if: 2.063125

So for huge number of calls add_index_if 5 times faster than add_index_mod on my CPU.



回答3:

I'd rather use mod, without digging into the assembly of the situation there are a couple of things to take into account here.

1) When you branch (if statement/function call/etc), your processor may need to flush it's pipeline. What this means, is you have a bunch of instructions that were executed before knowing if they needed to be executed, and that "processing power" is just lost. I'm not saying this will always happen, but it can

2) Lets say you want to find the entry that happened 5 entries before your current, and do some math on it. Lets say you need the average between the two. Instead of doing the math and storing the result, having an extra variable, and all of that clumsiness, you can have a more elegant solution.

(array[index%10] + array[(index-5)%10])/2;

This can now wrap around your circular buffer.

I feel you get more used to writing code in that way if you do it that way, rather than have if/else statements to determine your index.

One thing to watch out for with this, though. If you take the modulus of a negative number, c is mathematically wrong. You will end up with a negative answer. So start your indexing off at your top index if you're going to do something like this (e.g. finding an entry before your current entry)

Hope this helps.