I'm trying to implement an algorithm to find the minimum and the maximum values among a set of longs in a file. My test file contains one billion longs.
The algorithm works as expected but does not perform faster than the naive version. It should be significantly faster as the naive version performs roughly 2n comparisons, whereas this version performs 3n/2 comparisons.
$ time ./findminmax_naive somelongs
count: 1000000000
min: 0
max: 2147483647
real 0m24.156s
user 0m4.956s
sys 0m3.896s
$ time ./findminmax_faster somelongs
count: 1000000000
min: 0
max: 2147483647
real 0m25.048s
user 0m6.948s
sys 0m3.980s
Here is the "naive" version:
#include <alloca.h>
#include <stdio.h>
#include <stdlib.h>
int
main(int ac, char *av[])
{
FILE *f;
long count, readcount, i, min, max;
size_t rlen;
long *n;
if (ac != 2 && ac != 3) {
fprintf(stderr, "Usage: %s <file> [readcount]\n", av[0]);
exit(1);
}
f = fopen(av[1], "r");
if (f == NULL)
perror("fopen");
readcount = 1024;
if (ac == 3)
readcount = atol(av[2]);
n = alloca(sizeof (long) * readcount);
rlen = fread(n, sizeof (*n), 1, f);
min = max = n[0];
count = 1;
while (1) {
rlen = fread(n, sizeof (*n), readcount, f);
for (i = 0; i < (long)rlen; i++) {
count++;
if (n[i] < min)
min = n[i];
if (n[i] > max)
max = n[i];
}
if (feof(f))
break;
}
printf("count: %ld\n", count);
printf("min: %ld\n", min);
printf("max: %ld\n", max);
exit(0);
}
Here is the code of the (should-be) "faster" version:
#include <alloca.h>
#include <stdio.h>
#include <stdlib.h>
int
main(int ac, char *av[])
{
FILE *f;
long count, readcount, i, min, max;
size_t rlen;
long *n;
if (ac != 2 && ac != 3) {
fprintf(stderr, "Usage: %s <file> [readcount]\n", av[0]);
exit(1);
}
f = fopen(av[1], "r");
if (f == NULL)
perror("fopen");
readcount = 1024;
if (ac == 3)
readcount = atol(av[2]);
readcount = (readcount + 1) & (-1 << 1);
n = alloca(sizeof (long) * readcount);
rlen = fread(n, sizeof (*n), 1, f);
min = max = n[0];
count = 1;
while (1) {
rlen = fread(n, sizeof (*n), readcount, f);
for (i = 0; i < (long)rlen; i += 2) {
count += 2;
if (n[i] < n[i + 1]) {
if (n[i] < min)
min = n[i];
if (n[i + 1] > max)
max = n[i + 1];
} else {
if (n[i + 1] < min)
min = n[i + 1];
if (n[i] > max)
max = n[i];
}
}
if (feof(f))
break;
}
if (rlen % 2) {
if (n[rlen - 1] < min)
min = n[rlen - 1];
if (n[rlen - 1] > max)
max = n[rlen - 1];
count++;
}
printf("count: %ld\n", count);
printf("min: %ld\n", min);
printf("max: %ld\n", max);
exit(0);
}
Do you have any idea what I missed?
Thanks for your help.
-- Jeremie
I can think of two things:
First of all, excuse me to answer to all questions in a single answer. I know I should not do this on stackoverflow.com, but given the different subjects are more or less intertwined, it is easier that way.
Introduction
So, here is the code I am now using to test the algorithm. The differences with the previous versions:
Here is the code:
Test cases
Here are the files I use for test case:
I/O penalty
Ok, first let's warm up the cache and verify that there is no missing page then that could screw up the results:
So as you can see there were no major page faults. From now on, we can take as granted that there will be no I/O impact. 2. Instruction count
Instruction count and branch misses
@H2CO3, @vvy, you are absolutely right about the fact that other instructions count as well (I will consider that each operation takes the same number of CPU cycles here, and will live appart CPU cache miss). I don't know why the litterature I read so far about algorithms only focuses on the number of comparisons (ok, I admit I haven't read to much though ;)).
I've commented each step in the loop with my own reckoning of average case (I think worst case is not interesting here), and this differ slightly for yours.
If I'm not mistaken : - For the trivial version: n * (c1 + 2*c2 + c4) + (x + y) * c3 - For the faster version: n/2 * (c1 + 3*c2 + c5) + (x + y) * c3
Now in my understanding, it is difficult to go further and speculat on how much CPU cycles each cN takes, as it varies from CPU to CPU.
Let's check how many instructions, branches and branch misses happened on my computer, which is mostly idle, while each algorithm executes on each test case, with a warm cache (note that I tested each case 3 times to verify there are no significant deviation):
Random case
Note that we have less instructions for the faster version, but it takes much longer to run nonetheless, quit probably because there are more branch misses, by an order or magnitude!
Best case
Worst case
So, very interestingly, the "random" case is actually faster than the worst case, which don't show much difference. The only difference I see is that my worst case contain "small" numbers (which can be encoded in 32 bits) whereas the random case contains true 64 bits numbers.
Random case with "small" integers
So let's try with a set of "small" random numbers (still encoded stored in 64 bits):
So, as I guessed, small numbers are actually slower to process than large ones, even if they are all contained 64 bits words. There are much more branch misses with "small" numbers, for a reason that probably only CPU designers would be able to tell :-).
Another noticable thing is that often the elapsed time measured by perf(1) is sometimes smaller than the one measured by the program itself. I think this is explained by the fact that the program itself uses real time, whereas perf(1) uses the process time (the time the process actually runs). I tried to use getrusage(2), but the time I get here do not match (for instance I get 1.6s as user time and 1.4s as system time, whereas perf(1) measures 6.8s).
Conclusion
So, if you got that far, thank you :). I am sorry though that this all experimentation led to only vague conclusion. Hopefully enlightened readers will sched some light on this :).
-- Jeremie
as @chr said , "The file I/O will dwarf any optimization in the algorithm itself".
Besides, less comparation does not equal less running time consumption. This two algorithms has time complexity of O(n), which ignored the actual statement costs, and the abstract costs.
For example, as two rough frames of this two algorithms,the time consumption is time of all the statements cost in your program.
For example:
the time cost:
T1 = 2n*c1 + 2n*c2 + x*c3 + y*c3 + 2n*c4 = 2n * (c1+c2+c4) +(x+y)*c3
which x and y are up to the order of your data.
And,the (3/2)n's comparation,
the time cost:
T2 = n*c1 + n*c5 + n*2*c2 + (x'+y')*c3 +n*c6 = n*(c1+c5+c6) + 2n*c2 + (x'+y')*c3
if the max and min is the first two elements of your data,x=x'=1;y=y'=1.
T1-T2 = n*c1 + 2n*c4 - n*c5 -n*c6. to differen complier, T1-T2 may be different.
More complex is that the x,y,x',y' is variable,but I willn't do further discussion about that. My purpose is to show that the real running time cost.
If you are still confused after you read this above, you should read chapter2.2 of the Introduction to Algorithms.
The key is branch prediction. Unless the file is sorted in a pathological worst-case order, the naive version will perform 2n branches that are predicted correctly almost every single time. Your "clever" version performs n/2 branches that are almost never predicted correctly, and an additional n comparisons that are likely to be predicted correctly.
How much wrongly-predicted branches cost depends a lot on the cpu architecture and even the particular cpu model, but at the very least I would expect an incorrectly predicted branch to cost several times as much as a correctly predicted one. In an extreme case, correctly-predicted branches might have an effective cost of zero cycles.
As an interesting example, I recently experimented with optimizing
strlen
, and found that in isolation an extremely naive unrolledstrlen
- comparing and branching on one byte at a time - was faster than the clever vectorized approaches. This is almost surely becausestrlen
has the special property that every branch until the last one will always be predicted correctly.By the way, to test my hypothesis, try this input pattern:
999999999, 1000000001, 999999998, 1000000002, 999999997, 1000000003, ...
It will give worst-case branch prediction for the naive algorithm and best-case for the outer conditional on your clever version.