Why is my program so slow?

2020-02-05 08:48发布

问题:

Someone decided to do a quick test to see how native client compared to javascript in terms of speed. They did that by running 10 000 000 sqrt calculations and measuring the time it took. The result with javascript: 0.096 seconds, and with NaCl: 4.241 seconds... How can that be? Isn't speed one of the reasons to use NaCl in the first place? Or am i missing some compiler flags or something?

Heres the code that was run:

clock_t t = clock();
float result = 0;
for(int i = 0; i < 10000000; ++i) {
    result += sqrt(i);
}
t = clock() - t;      
float tt = ((float)t)/CLOCKS_PER_SEC;
pp::Var var_reply = pp::Var(tt);
PostMessage(var_reply);

PS: This question is an edited version of something that appeared in the native client mailing list

回答1:

NOTE: This answer is an edited version of something that appeared in the native client mailing list

Microbenchmarks are tricky: unless you understand what you are doing VERY well it's easy to produce apples-to-oranges comparisons which are not relevant to the behavior you want to observe/measure at all.

I'll elaborate a bit using your own example (I'll exclude NaCl and stick to the existing, "tried and true" technologies).

Here is your test as native C program:

$ cat test1.c
#include <math.h>
#include <time.h>
#include <stdio.h>

int main() {
  clock_t t = clock();
  float result = 0;
  for(int i = 0; i < 1000000000; ++i) {
      result += sqrt(i);
  }
  t = clock() - t;
  float tt = ((float)t)/CLOCKS_PER_SEC;
  printf("%g %g\n", result, tt);

}
$ gcc -std=c99 -O2 test1.c -lm -o test1
$ ./test1
5.49756e+11 25.43

Ok. We can do billion cycles in 25.43 seconds. But let's see what takes time: let's replace "result += sqrt(i);" with "result += i;"

$ cat test2.c
#include <math.h>
#include <time.h>
#include <stdio.h>

int main() {
  clock_t t = clock();
  float result = 0;
  for(int i = 0; i < 1000000000; ++i) {
      result += i;
  }
  t = clock() - t;
  float tt = ((float)t)/CLOCKS_PER_SEC;
  printf("%g %g\n", result, tt);
}
$ gcc -std=c99 -O2 test2.c -lm -o test2
$ ./test2
1.80144e+16 1.21

Wow! 95% of time was actually spend in CPU-provided sqrt function, everything else took less then 5%. But what if we'll change the code just a bit: replace "printf("%g %g\n", result, tt);" with "printf("%g\n", tt);" ?

$ cat test3.c
#include <math.h>
#include <time.h>
#include <stdio.h>

int main() {
  clock_t t = clock();
  float result = 0;
  for(int i = 0; i < 1000000000; ++i) {
      result += sqrt(i);
  }
  t = clock() - t;
  float tt = ((float)t)/CLOCKS_PER_SEC;
  printf("%g\n", tt);
}
$ gcc -std=c99 -O2 test3.c -lm -o test3
$ ./test
1.44

Hmm... Looks like now "sqrt" is almost as fast as "+". How can this be? How can printf affect the previous cycle AT ALL?

Let's see:

$ gcc -std=c99 -O2 test1.c -S -o -
...
.L3:
        cvtsi2sd        %ebp, %xmm1
        sqrtsd  %xmm1, %xmm0
        ucomisd %xmm0, %xmm0
        jp      .L7
        je      .L2
.L7:
        movapd  %xmm1, %xmm0
        movss   %xmm2, (%rsp)
        call    sqrt
        movss   (%rsp), %xmm2
.L2:
        unpcklps        %xmm2, %xmm2
        addl    $1, %ebp
        cmpl    $1000000000, %ebp
        cvtps2pd        %xmm2, %xmm2
        addsd   %xmm0, %xmm2
        unpcklpd        %xmm2, %xmm2
        cvtpd2ps        %xmm2, %xmm2
        jne     .L3
 ...
$ gcc -std=c99 -O2 test3.c -S -o -
...
        xorpd   %xmm1, %xmm1
...
.L5:
        cvtsi2sd        %ebp, %xmm0
        ucomisd %xmm0, %xmm1
        ja      .L14
.L10:
        addl    $1, %ebp
        cmpl    $1000000000, %ebp
        jne     .L5
...
.L14:
        sqrtsd  %xmm0, %xmm2
        ucomisd %xmm2, %xmm2
        jp      .L12
        .p2align 4,,2
        je      .L10
.L12:
        movsd   %xmm1, (%rsp)
        .p2align 4,,5
        call    sqrt
        movsd   (%rsp), %xmm1
        .p2align 4,,4
        jmp     .L10
...

First version actually calls sqrt billion times, but second one does not do that at all! Instead it checks if the number is negative and calls sqrt only in this case! Why? What the compiler (or, rather, compiler authors) are trying to do here?

Well, it's simple: since we've not used "result" in this particular version it can safely omit "sqrt" call... if the value is not negative, that is! If it's negative then (depending on FPU flags) sqrt can do different things (return nonsensical result, crash the program, etc). That's why this version is dozen of times faster - but it does not calculate square roots at all!

Here is final example which shows how wrong microbenchmarks can go:

$ cat test4.c
#include <math.h>
#include <time.h>
#include <stdio.h>

int main() {
  clock_t t = clock();
  int result = 0;
  for(int i = 0; i < 1000000000; ++i) {
      result += 2;
  }
  t = clock() - t;
  float tt = ((float)t)/CLOCKS_PER_SEC;
  printf("%d %g\n", result, tt);
}
$ gcc -std=c99 -O2 test4.c -lm -o test4
$ ./test4
2000000000 0

Execution time is... ZERO? How can it be? Billion calculations in less then blink of eye? Let's see:

$ gcc -std=c99 -O2 test1.c -S -o -
...
        call    clock
        movq    %rax, %rbx
        call    clock
        subq    %rbx, %rax
        movl    $2000000000, %edx
        movl    $.LC1, %esi
        cvtsi2ssq       %rax, %xmm0
        movl    $1, %edi
        movl    $1, %eax
        divss   .LC0(%rip), %xmm0
        unpcklps        %xmm0, %xmm0
        cvtps2pd        %xmm0, %xmm0
...

Uh, oh, cycle is completely eliminated! All calculations happened at compile time and to add insult to injury both "clock" calls were executed before body of the cycle to boot!

What if we'll put it in separate function?

$ cat test5.c
#include <math.h>
#include <time.h>
#include <stdio.h>

int testfunc(int num, int max) {
  int result = 0;
  for(int i = 0; i < max; ++i) {
      result += num;
  }
  return result;
}

int main() {
  clock_t t = clock();
  int result = testfunc(2, 1000000000);
  t = clock() - t;
  float tt = ((float)t)/CLOCKS_PER_SEC;
  printf("%d %g\n", result, tt);
}
$ gcc -std=c99 -O2 test5.c -lm -o test5
$ ./test5
2000000000 0

Still the same??? How can this be?

$ gcc -std=c99 -O2 test5.c -S -o -
...
.globl testfunc
        .type   testfunc, @function
testfunc:
.LFB16:
        .cfi_startproc
        xorl    %eax, %eax
        testl   %esi, %esi
        jle     .L3
        movl    %esi, %eax
        imull   %edi, %eax
.L3:
        rep
        ret
        .cfi_endproc
...

Uh-oh: compiler is clever enough to replace cycle with a multiplication!

Now if you'll add NaCl on one side and JavaScript on the other side you'll get such a complex system that results are literally unpredictable.

The problem here is that for microbenchmark you are trying to isolate piece of code and then evaluate it's properties, but then compiler (no matter JIT or AOT) will try to thwart your efforts because it tries to remove all the useless calculations from your program!

Microbenchmarks useful, sure, but they are FORENSIC ANALYSIS tool, not something you want to use to compare speed of two different systems! For that you need some "real" (in some sense of the world: something which can not be optimized to pieces by over-eager compiler) workload: sorting algorithms are popular, in particular.

Benchmarks which use sqrt are especially nasty because, as we've seen, usually they spend over 90% of time executing one single CPU instruction: sqrtsd (fsqrt if it's 32-bit version) which is, of course, identical for JavaScript and NaCl. These benchmarks (if properly implemented) may serve as a litmus test (if speed of some implementation differs too much from what simple native version exhibits then you are doing something wrong), but they are useless as comparison of speeds of NaCl, JavaScript, C# or Visual Basic.