How is it possible that BITWISE AND operation to t

2020-02-15 07:28发布

站内文章 / C

82 0

仙女界的扛把子

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I wanted to test if bitwise operations really are faster to execute than arithmetic operation. I thought they were.

I wrote a small C program to test this hypothesis and to my surprise the addition takes less on average than bitwise AND operation. This is surprising to me and I cannot understand why this is happening.

From what I know for addition the carry from the less significant bits should be carried to the next bits because the result depends on the carry too. It does not make sense to me that a logic operator is slower than addition.

My cod is below:

#include<stdio.h>
#include<time.h>

int main() 
{
   int x=10;
   int y=25;
   int z=x+y;
   printf("Sum of x+y = %i", z);
   time_t start = clock();
   for(int i=0;i<100000;i++)z=x+y;
   time_t stop = clock();

   printf("\n\nArithmetic instructions take: %d",stop-start);
   start = clock();
   for(int i=0;i<100000;i++)z=x&y;
   stop = clock();

   printf("\n\nLogic instructions take: %d",stop-start);
}

Some of the results:

Arithmetic instructions take: 327
Logic instructions take: 360

Arithmetic instructions take: 271
Logic instructions take: 271

Arithmetic instructions take: 287
Logic instructions take: 294

Arithmetic instructions take: 279
Logic instructions take: 266

Arithmetic instructions take: 265
Logic instructions take: 296

These results are taken from consecutive runs of the program.

As you can see on average the logic operator takes longer on average than the arithmetic operator.

回答1:

ok, let's take this "measuring" and blow it up, 100k is a bit few

#include<stdio.h>
#include<time.h>
#define limit 10000000000

int main() 
{
   int x=10, y=25, z;

   time_t start = clock();
   for(long long i=0;i<limit;i++)z=x+y;
   time_t stop = clock();
   printf("Arithmetic instructions take: %ld\n",stop-start);

   start = clock();
   for(long long i=0;i<limit;i++)z=x&y;
   stop = clock();
   printf("Logic instructions take: %ld\n",stop-start);
}

this will run a bit longer. First let's try with no optimization:

thomas@TS-VB:~/src$ g++ -o trash trash.c 
thomas@TS-VB:~/src$ ./trash 
Arithmetic instructions take: 21910636
Logic instructions take: 21890332

you see, both loops take about the same time.

compiling with -S reveals why (only relevant part of .s file shown here) :

// this is the assembly for the first loop
.L3:
    movl    32(%esp), %eax
    movl    28(%esp), %edx
    addl    %edx, %eax             // <<-- ADD
    movl    %eax, 40(%esp)
    addl    $1, 48(%esp)
    adcl    $0, 52(%esp)
.L2:
    cmpl    $2, 52(%esp)
    jl  .L3
    cmpl    $2, 52(%esp)
    jg  .L9
    cmpl    $1410065407, 48(%esp)
    jbe .L3

// this is the one for the second
.L9:
    movl    32(%esp), %eax
    movl    28(%esp), %edx
    andl    %edx, %eax             // <<--- AND
    movl    %eax, 40(%esp)
    addl    $1, 56(%esp)
    adcl    $0, 60(%esp)
.L5:
    cmpl    $2, 60(%esp)
    jl  .L6
    cmpl    $2, 60(%esp)
    jg  .L10
    cmpl    $1410065407, 56(%esp)
    jbe .L6
.L10:

looking into the CPUs instruction set tells us, both ADD and AND will take the same amount of cycles --> the 2 loops will run the same amount of time

Now with optimization:

thomas@TS-VB:~/src$ g++ -O3 -o trash trash.c 
thomas@TS-VB:~/src$ ./trash 
Arithmetic instructions take: 112
Logic instructions take: 74

The loop has been optimized away. The value calculated is never needed, so the compiler decided to not run it at all

conclusion: If you shoot 3 times into the forest, and hit 2 boars and 1 rabbit, it doesn't mean there a twice as much boars than rabbits inside

回答2:

Let's start by looking at your code. The loops don't actually do anything. Any reasonable compiler will see that you don't use the variable z after the first call to printf, so it is perfectly safe to throw away. Of course, the compiler doesn't have to do it, but any reasonable compiler with any reasonable levels of optimisations enabled will do so.

Let's look at what a compiler did to your code (standard clang with -O2 optimisation level):

    leaq    L_.str(%rip), %rdi
    movl    $35, %esi
    xorl    %eax, %eax
    callq   _printf

This is the first printf ("Sum of ..."), notice how the generated code didn't actually add anything, the compiler knows the values of x and y and just calculated its sum and calls printf with 35.

    callq   _clock
    movq    %rax, %rbx
    callq   _clock

Call clock, save its result in a temporary register, call clock again,

    movq    %rax, %rcx
    subq    %rbx, %rcx
    leaq    L_.str.1(%rip), %rdi
    xorl    %eax, %eax
    movq    %rcx, %rsi

Subtract start from end, set up arguments for printf,

    callq   _printf

Call printf.

The second loop is removed in an equivalent way. There are no loops because the compiler does the reasonable thing - it notices that z is not used after you modify it in the loop, so the compiler throws away any stores to it. And then since nothing is stored into it, it can also throw away x+y. And now, since the body of the loop is not doing anything, the loop can be thrown away. So your code essentially becomes:

printf("\n\nArithmetic instructions take: %d", clock() - clock());

Now, why is this relevant. It is relevant to understand some important concepts. The compiler doesn't slavishly translate one statement at a time into code. The compiler reads all (or as much as possible) of your code, tries to figure out what you actually mean and then generates code that behaves "as if" it executed all those statements. The language and compiler only care about preserving what we could call observable side effects. If computing a value is not observable, it doesn't need to be computed. The time to execute some code is not a side effect we care about, so the compiler doesn't care about preserving it, after all we want our code to be as fast as possible, so we'd like the time to execute something to be not observable at all.

The second part why this is relevant. It is pretty useless to measure how long something takes if you compiled it without optimisations. This is the loop in your code compiled without optimisations:

LBB0_1:
        cmpl    $100000, -28(%rbp)
        jge     LBB0_4
        movl    -8(%rbp), %eax
        addl    -12(%rbp), %eax
        movl    %eax, -16(%rbp)
        movl    -28(%rbp), %eax
        addl    $1, %eax
        movl    %eax, -28(%rbp)
        jmp     LBB0_1
LBB0_4:

You thought you were measuring the addl instruction here. But the whole loop contains so much more. In fact, most of the time in the loop is spent maintaining the loop, not actually executing your instruction. Most of the time is spent reading and writing values on the stack and calculating the loop variable. Any timing you happen to measure will be totally dominated by the infrastructure of the loop rather than operation you want to measure.

You loop very few times. I'm pretty certain that most of the time you actually measure is spent in clock() and not the code you're actually trying to measure. clock needs to do quite some work, reading time is often quite expensive.

Then we get to the question of the actual instructions you care about. They take the exact same amount of time. Here's the canonical source of everything related to instruction timing on x86.

But. It's very hard and almost pretty useless to reason about individual instructions. Pretty much every CPU in the past few decades is superscalar. This means that it will execute many instructions at the same time. What matters for how much time things take is more dependencies between instructions (can't start executing an instruction before its inputs are ready if those inputs are computed by previous instructions) rather than the actual instruction. While you can do dozens of calculations in registers per nanosecond, it can take hundreds of nanoseconds to fetch data from main memory. So even if we know that an instruction takes one cycle and your CPU does two cycles per nanosecond (it's usually around that), it can mean that the number of instructions we can finish in 100ns can be anywhere between 1 (if you need to wait for main memory) and 12800 (I don't know the real exact numbers, but I remember that Skylake can retire 64 floating point operations per cycle).

This is why microbenchmarks aren't seriously done anymore. If minor variations in how things are done can affect the outcome by a factor of twelve thousand, you can quickly see why measuring individual instructions is useless. Most measurements today are done on large parts of programs or whole programs. I do this a lot at work and I've had several situations where improving an algorithm changed memory access patterns that even though the algorithm could be mathematically proven faster, the behavior of the whole program suffered because of changed memory access patterns or such.

Sorry for such a rambling answer, but I'm trying to get to why even though there is a simple answer to your question: "your measurement method is bad" and also a real answer: "they are the same", there are actually interesting reasons why the question itself is unanswerable.

回答3:

This is just a few minutes worth of work, would rather also demonstrate bare metal and other such things but not worth the time right now.

Through testing of some functions to see what the calling convention is and also noting that for add it is generating

  400600:   8d 04 37                lea    (%rdi,%rsi,1),%eax
  400603:   c3                      retq

for and

  400610:   89 f8                   mov    %edi,%eax
  400612:   21 f0                   and    %esi,%eax
  400614:   c3                      retq

three instructions instead of two, five bytes instead of four, these bits if information both do and dont matter depending. But to make it more fair will do the same thing for each operation.

Also want the do it a zillion times loop closely coupled and not compiled as that may end up creating some variation. And lastly the alignment try to make that fair.

.balign 32
nop
.balign 256
.globl and_test
and_test:
    mov %edi,%eax
    and %esi,%eax
    sub $1,%edx
    jne and_test
    retq

.balign 32
nop
.balign 256
.globl add_test
add_test:
    mov %edi,%eax
    add %esi,%eax
    sub $1,%edx
    jne add_test
    retq

.balign 256
    nop

derived from yours

#include<stdio.h>
#include<time.h>
unsigned int add_test ( unsigned int a, unsigned int b, unsigned int x );
unsigned int and_test ( unsigned int a, unsigned int b, unsigned int x );
int main() 
{
   int x=10;
   int y=25;
   time_t start,stop;
   for(int j=0;j<10;j++)
   {
       start = clock();
       add_test(10,25,2000000000);
       stop = clock();
       printf("%u %u\n",j,(int)(stop-start));

   }
   for(int j=0;j<10;j++)
   {
       start = clock();
       and_test(10,25,2000000000);
       stop = clock();
       printf("%u %u\n",j,(int)(stop-start));

   }
   return(0);
}

first run as expected the first loop took longer as it wasnt in the cache? should not have taken that much longer so that doesnt make sense, perhaps other reasons...

but we stay fairly consistent. second run, the times stay somewhat consistent.

note that this is 2 billion loops. four instructions per. my clocks per second is 1000000 at 3.4ghz 0.8772 clocks per loop or 0.2193 clocks per instruction, how is that possible? superscaler processor.

A LOT more work could be done, here, this was just a few minutes worth and hopefully it is just enough to demonstrate (as others have already as well) that you cant really see the difference with a test like this.

I could do a demo with something more linear like an arm and something we could read the clock/timer register as part of the code under test, as the calling of the clock code are all part of the code under test and can vary here. Hopefully that is not necessary, the results are much more consistent though using sram, controlling all the instructions under test, etc. and with that you can see alignment differences you can see the cost of the cache read on the first loop but not the remaining, etc...(a few clocks total although 10ms as we see here, hmm, might be on par for an x86 system dont know benchmarking x86 is a near complete waste of time, no fun in it and the results dont translate to other x86 computers that well)

As pointed out in your other question that was closed as a duplicate, and I hate using links here, should learn how to cut and paste pictures (TODO).

https://en.wikipedia.org/wiki/AND_gate
https://en.wikipedia.org/wiki/Adder_(electronics)

Assuming feeding the math/logic operation for add and and is the same and we are only trying to measure the difference between them you are correct the AND is faster not getting into further detail you can see an and only has one stage/gate. Where a full adder takes three levels, back of the envelope math, three times as much time to settle the signals once the inputs change than the AND....BUT....Although there are some exceptions, chips are not designed to take advantage of this (well multiply and divide vs add/and/xor/etc, yes they are or are more likely to be). One would design these simple operations to be one clock cycle, on a clock the inputs to the combinational logic (the actual AND or ADD) are latched, on the next clock cycle the result is latched from the other end and begins its journey to the register file or out the core to memory, etc...At some point in the design you do synthesis into the gates available for the foundry/process you are using then do a timing analysis/closure on that and look for long poles in the tent. Extremely unlikely (impossible) that the add is a long pole, both add and and are very very short poles, but you determine at that point what your max clock rate is, if you wanted a 4ghz procerssor but the result is 2.7 well you need to take those long poles and turn them into two or more clock operations. the time it takes to do an add vs and which should vary add should be longer, is so fast and in the noise that it is all within a clock cycle so even if you did a functional simulation of the logic design you would not see the difference you need to implement an and and a full adder in say pspice using transistors and other components, then feed step changes into the inputs and watch how long it takes to settle, that or build them from discrete components from radio shack and try, although the results might be too fast for your scope, so use pspice or other.

think of writing equations to solve something you can write maybe a long equation or you can break that into multiple smaller ones with intermediate variables

this
a = b+c+d+e+f+g;
vs
x=b+c;
y=d+e;
z=f+g;
a=x+y;
a=a+z;

one clock vs 5 clocks, but each of the 5 clocks can be faster if this was the longest pole in the tent. all the other logic is that much faster too feeding this. (actually x,y,z can be one clock, then either a=x+y+z in the next or make it two more)

multiply and divide are different simply because the logic explodes exponentially, there is no magic to multiply or divide they have to work the same way we do things on pencil and paper. you can take shortcuts with binary if you think about it. as you can only multiply by 0 or 1 before shifing and adding to the accumulator. the logic equations for one clock still explode exponentially and then you can do parallel things. it burns a ton of chip realestate, so you may choose to make multiply as well as divide take more than one clock and hide those in the pipeline. or you can choose to burn a significant amount of chip realestate...Look at the documentation for some of the arm cores you can at compile time (when you compile/synthesize the core) choose a one clock or multi-clock multiply to balance chip size vs performance. x86 we dont buy the IP and make the chips ourselves so it is up to intel how they do it, and very likely microcoded so by just microcoding you can tweak how things happen or do it in an alu type operation.

So you might detect multiply or divide performance vs add/and with a test like this but either they did it in one clock and you cant ever see it, or they may have buried it in two or more steps in the pipe so that it averages out right and to see it you would need access to a chip sim. using timers and running something a billion times is fun, but to actually see instruction performance you need a chip sim and need to tweak the code to keep the code not under test from affecting the results.