Someone decided to do a quick test to see how native client compared to javascript in terms of speed. They did that by running 10 000 000 sqrt calculations and measuring the time it took. The result with javascript: 0.096 seconds, and with NaCl: 4.241 seconds... How can that be? Isn't speed one of the reasons to use NaCl in the first place? Or am i missing some compiler flags or something?
Heres the code that was run:
clock_t t = clock();
float result = 0;
for(int i = 0; i < 10000000; ++i) {
result += sqrt(i);
}
t = clock() - t;
float tt = ((float)t)/CLOCKS_PER_SEC;
pp::Var var_reply = pp::Var(tt);
PostMessage(var_reply);
PS: This question is an edited version of something that appeared in the native client mailing list
NOTE: This answer is an edited version of something that appeared in the native client mailing list
Microbenchmarks are tricky: unless you understand what you are doing VERY well it's easy to produce apples-to-oranges comparisons which are not relevant to the behavior you want to observe/measure at all.
I'll elaborate a bit using your own example (I'll exclude NaCl and stick to the existing, "tried and true" technologies).
Here is your test as native C program:
$ cat test1.c
#include <math.h>
#include <time.h>
#include <stdio.h>
int main() {
clock_t t = clock();
float result = 0;
for(int i = 0; i < 1000000000; ++i) {
result += sqrt(i);
}
t = clock() - t;
float tt = ((float)t)/CLOCKS_PER_SEC;
printf("%g %g\n", result, tt);
}
$ gcc -std=c99 -O2 test1.c -lm -o test1
$ ./test1
5.49756e+11 25.43
Ok. We can do billion cycles in 25.43 seconds. But let's see what takes time: let's replace "result += sqrt(i);" with "result += i;"
$ cat test2.c
#include <math.h>
#include <time.h>
#include <stdio.h>
int main() {
clock_t t = clock();
float result = 0;
for(int i = 0; i < 1000000000; ++i) {
result += i;
}
t = clock() - t;
float tt = ((float)t)/CLOCKS_PER_SEC;
printf("%g %g\n", result, tt);
}
$ gcc -std=c99 -O2 test2.c -lm -o test2
$ ./test2
1.80144e+16 1.21
Wow! 95% of time was actually spend in CPU-provided sqrt function, everything else took less then 5%. But what if we'll change the code just a bit: replace "printf("%g %g\n", result, tt);" with "printf("%g\n", tt);" ?
$ cat test3.c
#include <math.h>
#include <time.h>
#include <stdio.h>
int main() {
clock_t t = clock();
float result = 0;
for(int i = 0; i < 1000000000; ++i) {
result += sqrt(i);
}
t = clock() - t;
float tt = ((float)t)/CLOCKS_PER_SEC;
printf("%g\n", tt);
}
$ gcc -std=c99 -O2 test3.c -lm -o test3
$ ./test
1.44
Hmm... Looks like now "sqrt" is almost as fast as "+". How can this be? How can printf affect the previous cycle AT ALL?
Let's see:
$ gcc -std=c99 -O2 test1.c -S -o -
...
.L3:
cvtsi2sd %ebp, %xmm1
sqrtsd %xmm1, %xmm0
ucomisd %xmm0, %xmm0
jp .L7
je .L2
.L7:
movapd %xmm1, %xmm0
movss %xmm2, (%rsp)
call sqrt
movss (%rsp), %xmm2
.L2:
unpcklps %xmm2, %xmm2
addl $1, %ebp
cmpl $1000000000, %ebp
cvtps2pd %xmm2, %xmm2
addsd %xmm0, %xmm2
unpcklpd %xmm2, %xmm2
cvtpd2ps %xmm2, %xmm2
jne .L3
...
$ gcc -std=c99 -O2 test3.c -S -o -
...
xorpd %xmm1, %xmm1
...
.L5:
cvtsi2sd %ebp, %xmm0
ucomisd %xmm0, %xmm1
ja .L14
.L10:
addl $1, %ebp
cmpl $1000000000, %ebp
jne .L5
...
.L14:
sqrtsd %xmm0, %xmm2
ucomisd %xmm2, %xmm2
jp .L12
.p2align 4,,2
je .L10
.L12:
movsd %xmm1, (%rsp)
.p2align 4,,5
call sqrt
movsd (%rsp), %xmm1
.p2align 4,,4
jmp .L10
...
First version actually calls sqrt billion times, but second one does not do that at all! Instead it checks if the number is negative and calls sqrt only in this case! Why? What the compiler (or, rather, compiler authors) are trying to do here?
Well, it's simple: since we've not used "result" in this particular version it can safely omit "sqrt" call... if the value is not negative, that is! If it's negative then (depending on FPU flags) sqrt can do different things (return nonsensical result, crash the program, etc). That's why this version is dozen of times faster - but it does not calculate square roots at all!
Here is final example which shows how wrong microbenchmarks can go:
$ cat test4.c
#include <math.h>
#include <time.h>
#include <stdio.h>
int main() {
clock_t t = clock();
int result = 0;
for(int i = 0; i < 1000000000; ++i) {
result += 2;
}
t = clock() - t;
float tt = ((float)t)/CLOCKS_PER_SEC;
printf("%d %g\n", result, tt);
}
$ gcc -std=c99 -O2 test4.c -lm -o test4
$ ./test4
2000000000 0
Execution time is... ZERO? How can it be? Billion calculations in less then blink of eye? Let's see:
$ gcc -std=c99 -O2 test1.c -S -o -
...
call clock
movq %rax, %rbx
call clock
subq %rbx, %rax
movl $2000000000, %edx
movl $.LC1, %esi
cvtsi2ssq %rax, %xmm0
movl $1, %edi
movl $1, %eax
divss .LC0(%rip), %xmm0
unpcklps %xmm0, %xmm0
cvtps2pd %xmm0, %xmm0
...
Uh, oh, cycle is completely eliminated! All calculations happened at compile time and to add insult to injury both "clock" calls were executed before body of the cycle to boot!
What if we'll put it in separate function?
$ cat test5.c
#include <math.h>
#include <time.h>
#include <stdio.h>
int testfunc(int num, int max) {
int result = 0;
for(int i = 0; i < max; ++i) {
result += num;
}
return result;
}
int main() {
clock_t t = clock();
int result = testfunc(2, 1000000000);
t = clock() - t;
float tt = ((float)t)/CLOCKS_PER_SEC;
printf("%d %g\n", result, tt);
}
$ gcc -std=c99 -O2 test5.c -lm -o test5
$ ./test5
2000000000 0
Still the same??? How can this be?
$ gcc -std=c99 -O2 test5.c -S -o -
...
.globl testfunc
.type testfunc, @function
testfunc:
.LFB16:
.cfi_startproc
xorl %eax, %eax
testl %esi, %esi
jle .L3
movl %esi, %eax
imull %edi, %eax
.L3:
rep
ret
.cfi_endproc
...
Uh-oh: compiler is clever enough to replace cycle with a multiplication!
Now if you'll add NaCl on one side and JavaScript on the other side you'll get such a complex system that results are literally unpredictable.
The problem here is that for microbenchmark you are trying to isolate piece of code and then evaluate it's properties, but then compiler (no matter JIT or AOT) will try to thwart your efforts because it tries to remove all the useless calculations from your program!
Microbenchmarks useful, sure, but they are FORENSIC ANALYSIS tool, not something you want to use to compare speed of two different systems! For that you need some "real" (in some sense of the world: something which can not be optimized to pieces by over-eager compiler) workload: sorting algorithms are popular, in particular.
Benchmarks which use sqrt are especially nasty because, as we've seen, usually they spend over 90% of time executing one single CPU instruction: sqrtsd (fsqrt if it's 32-bit version) which is, of course, identical for JavaScript and NaCl. These benchmarks (if properly implemented) may serve as a litmus test (if speed of some implementation differs too much from what simple native version exhibits then you are doing something wrong), but they are useless as comparison of speeds of NaCl, JavaScript, C# or Visual Basic.