Someone decided to do a quick test to see how native client compared to javascript in terms of speed. They did that by running 10 000 000 sqrt calculations and measuring the time it took. The result with javascript: 0.096 seconds, and with NaCl: 4.241 seconds... How can that be? Isn't speed one of the reasons to use NaCl in the first place? Or am i missing some compiler flags or something?
Heres the code that was run:
clock_t t = clock();
float result = 0;
for(int i = 0; i < 10000000; ++i) {
result += sqrt(i);
}
t = clock() - t;
float tt = ((float)t)/CLOCKS_PER_SEC;
pp::Var var_reply = pp::Var(tt);
PostMessage(var_reply);
PS: This question is an edited version of something that appeared in the native client mailing list
NOTE: This answer is an edited version of something that appeared in the native client mailing list
Microbenchmarks are tricky: unless you understand what you are doing VERY well it's easy to produce apples-to-oranges comparisons which are not relevant to the behavior you want to observe/measure at all.
I'll elaborate a bit using your own example (I'll exclude NaCl and stick to the existing, "tried and true" technologies).
Here is your test as native C program:
Ok. We can do billion cycles in 25.43 seconds. But let's see what takes time: let's replace "result += sqrt(i);" with "result += i;"
Wow! 95% of time was actually spend in CPU-provided sqrt function, everything else took less then 5%. But what if we'll change the code just a bit: replace "printf("%g %g\n", result, tt);" with "printf("%g\n", tt);" ?
Hmm... Looks like now "sqrt" is almost as fast as "+". How can this be? How can printf affect the previous cycle AT ALL?
Let's see:
First version actually calls sqrt billion times, but second one does not do that at all! Instead it checks if the number is negative and calls sqrt only in this case! Why? What the compiler (or, rather, compiler authors) are trying to do here?
Well, it's simple: since we've not used "result" in this particular version it can safely omit "sqrt" call... if the value is not negative, that is! If it's negative then (depending on FPU flags) sqrt can do different things (return nonsensical result, crash the program, etc). That's why this version is dozen of times faster - but it does not calculate square roots at all!
Here is final example which shows how wrong microbenchmarks can go:
Execution time is... ZERO? How can it be? Billion calculations in less then blink of eye? Let's see:
Uh, oh, cycle is completely eliminated! All calculations happened at compile time and to add insult to injury both "clock" calls were executed before body of the cycle to boot!
What if we'll put it in separate function?
Still the same??? How can this be?
Uh-oh: compiler is clever enough to replace cycle with a multiplication!
Now if you'll add NaCl on one side and JavaScript on the other side you'll get such a complex system that results are literally unpredictable.
The problem here is that for microbenchmark you are trying to isolate piece of code and then evaluate it's properties, but then compiler (no matter JIT or AOT) will try to thwart your efforts because it tries to remove all the useless calculations from your program!
Microbenchmarks useful, sure, but they are FORENSIC ANALYSIS tool, not something you want to use to compare speed of two different systems! For that you need some "real" (in some sense of the world: something which can not be optimized to pieces by over-eager compiler) workload: sorting algorithms are popular, in particular.
Benchmarks which use sqrt are especially nasty because, as we've seen, usually they spend over 90% of time executing one single CPU instruction: sqrtsd (fsqrt if it's 32-bit version) which is, of course, identical for JavaScript and NaCl. These benchmarks (if properly implemented) may serve as a litmus test (if speed of some implementation differs too much from what simple native version exhibits then you are doing something wrong), but they are useless as comparison of speeds of NaCl, JavaScript, C# or Visual Basic.