Find both min and max together: algorithm should b

2019-04-15 04:29发布

问题:

I'm trying to implement an algorithm to find the minimum and the maximum values among a set of longs in a file. My test file contains one billion longs.

The algorithm works as expected but does not perform faster than the naive version. It should be significantly faster as the naive version performs roughly 2n comparisons, whereas this version performs 3n/2 comparisons.

$ time ./findminmax_naive somelongs 
count: 1000000000
min: 0
max: 2147483647

real    0m24.156s
user    0m4.956s
sys     0m3.896s

$ time ./findminmax_faster somelongs 
count: 1000000000
min: 0
max: 2147483647

real    0m25.048s
user    0m6.948s
sys     0m3.980s

Here is the "naive" version:

#include <alloca.h>
#include <stdio.h>
#include <stdlib.h>

int
main(int ac, char *av[])
{
        FILE *f;
        long count, readcount, i, min, max;
        size_t rlen;
        long *n;

        if (ac != 2 && ac != 3) {
                fprintf(stderr, "Usage: %s <file> [readcount]\n", av[0]);
                exit(1);
        }

        f = fopen(av[1], "r");
        if (f == NULL)
            perror("fopen");
        readcount = 1024;
        if (ac == 3)
            readcount = atol(av[2]);
        n = alloca(sizeof (long) * readcount);
        rlen = fread(n, sizeof (*n), 1, f);
        min = max = n[0];
        count = 1;
        while (1) {
                rlen = fread(n, sizeof (*n), readcount, f);
                for (i = 0; i < (long)rlen; i++) {
                        count++;
                        if (n[i] < min)
                            min = n[i];
                        if (n[i] > max)
                            max = n[i];
                }
                if (feof(f))
                        break;
        }
        printf("count: %ld\n", count);
        printf("min: %ld\n", min);
        printf("max: %ld\n", max);
        exit(0);
}

Here is the code of the (should-be) "faster" version:

#include <alloca.h>
#include <stdio.h>
#include <stdlib.h>

int
main(int ac, char *av[])
{
        FILE *f;
        long count, readcount, i, min, max;
        size_t rlen;
        long *n;

        if (ac != 2 && ac != 3) {
                fprintf(stderr, "Usage: %s <file> [readcount]\n", av[0]);
                exit(1);
        }

        f = fopen(av[1], "r");
        if (f == NULL)
                perror("fopen");
        readcount = 1024;
        if (ac == 3)
                readcount = atol(av[2]);
        readcount = (readcount + 1) & (-1 << 1);
        n = alloca(sizeof (long) * readcount);
        rlen = fread(n, sizeof (*n), 1, f);
        min = max = n[0];
        count = 1;
        while (1) {
                rlen = fread(n, sizeof (*n), readcount, f);
                for (i = 0; i < (long)rlen; i += 2) {
                        count += 2;
                        if (n[i] < n[i + 1]) {
                                if (n[i] < min)
                                        min = n[i];
                                if (n[i + 1] > max)
                                        max = n[i + 1];
                        } else {
                                if (n[i + 1] < min)
                                        min = n[i + 1];
                                if (n[i] > max)
                                        max = n[i];
                        }
                }
                if (feof(f))
                        break;
        }
        if (rlen % 2) {
                if (n[rlen - 1] < min)
                        min = n[rlen - 1];
                if (n[rlen - 1] > max)
                        max = n[rlen - 1];
                count++;
        }
        printf("count: %ld\n", count);
        printf("min: %ld\n", min);
        printf("max: %ld\n", max);
        exit(0);
}

Do you have any idea what I missed?

Thanks for your help.

-- Jeremie

回答1:

The key is branch prediction. Unless the file is sorted in a pathological worst-case order, the naive version will perform 2n branches that are predicted correctly almost every single time. Your "clever" version performs n/2 branches that are almost never predicted correctly, and an additional n comparisons that are likely to be predicted correctly.

How much wrongly-predicted branches cost depends a lot on the cpu architecture and even the particular cpu model, but at the very least I would expect an incorrectly predicted branch to cost several times as much as a correctly predicted one. In an extreme case, correctly-predicted branches might have an effective cost of zero cycles.

As an interesting example, I recently experimented with optimizing strlen, and found that in isolation an extremely naive unrolled strlen - comparing and branching on one byte at a time - was faster than the clever vectorized approaches. This is almost surely because strlen has the special property that every branch until the last one will always be predicted correctly.

By the way, to test my hypothesis, try this input pattern:

999999999, 1000000001, 999999998, 1000000002, 999999997, 1000000003, ...

It will give worst-case branch prediction for the naive algorithm and best-case for the outer conditional on your clever version.



回答2:

as @chr said , "The file I/O will dwarf any optimization in the algorithm itself".

Besides, less comparation does not equal less running time consumption. This two algorithms has time complexity of O(n), which ignored the actual statement costs, and the abstract costs.

For example, as two rough frames of this two algorithms,the time consumption is time of all the statements cost in your program.

For example:

//max and min initlaized as 0.
//c1,... reprents the time cost of each instruction.
while(i<count) {//c1
    if(a[i]>max)  //c2
        max =  a[i]; //c3
    i++;    //c4
}
//search of min is like below

the time cost:

T1 = 2n*c1 + 2n*c2 + x*c3 + y*c3 + 2n*c4 = 2n * (c1+c2+c4) +(x+y)*c3

which x and y are up to the order of your data.

And,the (3/2)n's comparation,

while(i<count)  //c1 
    if(a[i]<a[i+1]) {//c5
        if(a[i]<min) //c2
            min = a[i]; //c3
        if(a[i+1>max]) //c2
            max = a[i+1]; //c3
    }
    else
        ...
        //same as below,that swap i and i+1
    i+=2; //c6
}

the time cost:

T2 = n*c1 + n*c5 + n*2*c2 + (x'+y')*c3 +n*c6 = n*(c1+c5+c6) + 2n*c2 + (x'+y')*c3

if the max and min is the first two elements of your data,x=x'=1;y=y'=1.

T1-T2 = n*c1 + 2n*c4 - n*c5 -n*c6. to differen complier, T1-T2 may be different.

More complex is that the x,y,x',y' is variable,but I willn't do further discussion about that. My purpose is to show that the real running time cost.

If you are still confused after you read this above, you should read chapter2.2 of the Introduction to Algorithms.



回答3:

I can think of two things:

  1. It's playing hell with the branch predictor.
  2. The naive version is getting auto-vectorized by the compiler and the "clever" version isn't.


回答4:

First of all, excuse me to answer to all questions in a single answer. I know I should not do this on stackoverflow.com, but given the different subjects are more or less intertwined, it is easier that way.

Introduction

So, here is the code I am now using to test the algorithm. The differences with the previous versions:

  • it includes both versions which you choose through command-line argument;
  • @chr, @Dukeling: it mmap(2)s the file to prevent syscalls or library calls;
  • @chr, @Dukeling: it has an optional "warmup" option to fault all the pages into memory before running the chosen algorithm;
  • @Dukeling: the program will record the time taken by the algorithm only itself, using gettimeofday(2).

Here is the code:

#include <sys/mman.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#define ROUNDUP(x, y)   ((((x) + ((y) - 1)) / (y)) * (y))
#define USCARRY(x)      ((x) < 0 ? (x) + 1000000 : (x))

int
main(int ac, char *av[])
{
        struct stat st;
        struct timeval start, end;
        long count, min, max, pgsiz;
        long *n, *p, *endp;
        int fd, warmup;

        if (ac != 3 && ac != 4) {


              fprintf(stderr, "Usage: %s <\"trivial\"|\"faster\"> "
                    "<file> [warmup]\n", av[0]);
                exit(1);
        }

        fd = open(av[2], O_RDONLY);
        if (fd == -1)
                perror("open");
        fstat(fd, &st);
        pgsiz = sysconf(_SC_PAGESIZE);
        n = mmap(NULL, ROUNDUP(st.st_size, pgsiz), PROT_READ,
            MAP_SHARED, fd, 0);
        if (n == MAP_FAILED)
                perror("mmap");
        warmup = 0;
        if (ac == 4)
                warmup = atoi(av[3]);
        // warm up the filesystem cache
        count = st.st_size / sizeof (*n);
        endp = &n[count - 1];
        //printf("%zu\n", sizeof (*p));
        //printf("%zu\n", sizeof (count));
        //exit(0);
        if (warmup) {
                for (p = n; p <= endp; p++) {
                        fwrite(p, sizeof (*p), 1, stdout);
                        min = *p;
                }
        }
        // start algorithm
        gettimeofday(&start, NULL);
        if (!strcmp(av[1], "trivial")) {
                min = max = n[0];
                p = &n[1];
                while (p <= endp) {                     // c1 * n
                        if (p[0] < min)                 // c2 * n
                                min = p[0];             // c3 * x
                        if (p[0] > max)                 // c2 * n
                                max = p[0];             // c3 * y
                        p++;                            // c4 * n
                }                       
        } else if (!strcmp(av[1], "faster")) {
                min = max = n[0];
                p = &n[1];
                while (p < endp) {                      // c1 * n/2
                        if (p[0] < p[1]) {              // c2 * n/2
                                if (p[0] < min)         // c2 * n/4
                                        min = p[0];     // c3 * x/2
                                if (p[1] > max)         // c2 * n/4
                                        max = p[1];     // c3 * y/2
                        } else {
                                if (p[1] < min)         // c2 * n/4
                                        min = p[1];     // c3 * x/2
                                if (p[0] > max)         // c2 * n/4
                                        max = p[0];     // c3 * y/2
                        }
                        p += 2;                         // c5 * n
                }                                       
                if (p == endp) {
                        if (*endp < min)
                                min = *endp;
                        else if (*endp > max)
                                max = *endp;
                }
        } else {
                printf("sorry\n");
                exit(1);
        }
        gettimeofday(&end, NULL);
        printf("time: %ld.%ld\n", end.tv_sec - start.tv_sec,
            USCARRY(end.tv_usec - start.tv_usec));
        printf("count: %ld\n", count);
        printf("min: %ld\n", min);
        printf("max: %ld\n", max);
        exit(0);
}

Test cases

Here are the files I use for test case:

$ ls -l _*
-rw-r--r-- 1 jlh jlh 2400000000 May 27 23:37 _bestcase
-rw-r--r-- 1 jlh jlh 2400000000 May 27 08:40 _random
-rw-r--r-- 1 jlh jlh 2400000000 May 27 23:38 _worstcase

$ od -N 64 -t dL _bestcase 
0000000                    0            300000000
0000020                    1            299999999
0000040                    2            299999998
0000060                    3            299999997
0000100

$ od -N 64 -t dL _random
0000000  8600270969454287374  8436406000964321037
0000020  7348498334162074664  2050332610730417325
0000040  8183771519386464726  4134520779774161130
0000060  2045475555785071180  2859007060406233926
0000100

$ od -N 64 -t dL _worstcase 
0000000            150000000            150000000
0000020            149999999            150000001
0000040            149999998            150000002
0000060            149999997            150000003
0000100

I/O penalty

Ok, first let's warm up the cache and verify that there is no missing page then that could screw up the results:

$ ./findminmax trivial _random 
time: 3.543573
count: 300000000
min: 31499144704
max: 9223372004409096866

$ ./findminmax trivial _random 
time: 1.466323
count: 300000000
min: 31499144704
max: 9223372004409096866

$ perf stat -e  minor-faults,major-faults ./findminmax trivial _random 
time: 1.284729
count: 300000000
min: 31499144704
max: 9223372004409096866

 Performance counter stats for './findminmax trivial _random':

           586,066 minor-faults                                                
                 0 major-faults                                                

       1.350118552 seconds time elapsed

So as you can see there were no major page faults. From now on, we can take as granted that there will be no I/O impact. 2. Instruction count

Instruction count and branch misses

@H2CO3, @vvy, you are absolutely right about the fact that other instructions count as well (I will consider that each operation takes the same number of CPU cycles here, and will live appart CPU cache miss). I don't know why the litterature I read so far about algorithms only focuses on the number of comparisons (ok, I admit I haven't read to much though ;)).

I've commented each step in the loop with my own reckoning of average case (I think worst case is not interesting here), and this differ slightly for yours.

If I'm not mistaken : - For the trivial version: n * (c1 + 2*c2 + c4) + (x + y) * c3 - For the faster version: n/2 * (c1 + 3*c2 + c5) + (x + y) * c3

Now in my understanding, it is difficult to go further and speculat on how much CPU cycles each cN takes, as it varies from CPU to CPU.

Let's check how many instructions, branches and branch misses happened on my computer, which is mostly idle, while each algorithm executes on each test case, with a warm cache (note that I tested each case 3 times to verify there are no significant deviation):

Random case

$ perf stat -e branches,branch-miss,instructions ./findminmax_stackoverflow trivial _random
time: 1.547087
count: 300000000
min: 31499144704
max: 9223372004409096866

 Performance counter stats for './findminmax_stackoverflow trivial _random':

     1,083,101,126 branches                                                    
            52,388 branch-miss

     4,335,175,257 instructions              #    0.00  insns per cycle        

       1.623851849 seconds time elapsed


$ perf stat -e branches,branch-miss,instructions ./findminmax_stackoverflow faster _random
time: 2.748967
count: 300000000
min: 31499144704
max: 9223372004409096866

 Performance counter stats for './findminmax_stackoverflow faster _random':

       783,120,927 branches                                                    
        75,063,008 branch-miss

     3,735,286,264 instructions              #    0.00  insns per cycle        

       1.824884443 seconds time elapsed

Note that we have less instructions for the faster version, but it takes much longer to run nonetheless, quit probably because there are more branch misses, by an order or magnitude!

Best case

$ perf stat -e branches,branch-miss,instructions ./findminmax_stackoverflow trivial _bestcase 
time: 1.267697
count: 300000000
min: 0
max: 300000000

 Performance counter stats for './findminmax_stackoverflow trivial _bestcase':

     1,082,801,759 branches                                                    
            49,802 branch-miss

     4,334,200,448 instructions              #    0.00  insns per cycle        

       1.343425753 seconds time elapsed


$ perf stat -e branches,branch-miss,instructions ./findminmax_stackoverflow faster _bestcase 
time: 0.957440
count: 300000000
min: 0
max: 300000000

 Performance counter stats for './findminmax_stackoverflow faster _bestcase':

       782,844,232 branches                                                    
            49,768 branch-miss

     3,734,103,167 instructions              #    0.00  insns per cycle        

       1.035189822 seconds time elapsed

Worst case

$ perf stat -e branches,branch-miss,instructions ./findminmax_stackoverflow trivial _worstcase 
time: 7.860047
count: 300000000
min: 1
max: 299999999

 Performance counter stats for './findminmax_stackoverflow trivial _worstcase':

     1,490,947,270 branches                                                    
         2,127,876 branch-miss

     7,159,600,607 instructions              #    0.00  insns per cycle        

       6.916856158 seconds time elapsed


$ perf stat -e branches,branch-miss,instructions ./findminmax_stackoverflow faster _worstcase 
time: 7.616476
count: 300000000
min: 1
max: 299999999

 Performance counter stats for './findminmax_stackoverflow faster _worstcase':

     1,196,744,455 branches                                                    
         2,024,250 branch-miss

     6,594,182,002 instructions              #    0.00  insns per cycle        

       6.675068846 seconds time elapsed

So, very interestingly, the "random" case is actually faster than the worst case, which don't show much difference. The only difference I see is that my worst case contain "small" numbers (which can be encoded in 32 bits) whereas the random case contains true 64 bits numbers.

Random case with "small" integers

So let's try with a set of "small" random numbers (still encoded stored in 64 bits):

$ od -N 64 -t dL _randomsmall 
0000000           1418331637           2076047555
0000020             22077878           1677970822
0000040           1845481621            609558726
0000060           1668260452            335112094
0000100

$ perf stat -e branches,branch-miss,instructions ./findminmax_stackoverflow trivial _randomsmall 
time: 7.682443
count: 300000000
min: 9
max: 2147483647

 Performance counter stats for './findminmax_stackoverflow trivial _randomsmall':

     1,481,062,942 branches                                                    
         2,564,853 branch-miss

     6,223,311,378 instructions              #    0.00  insns per cycle        

       6.739897078 seconds time elapsed


$ perf stat -e branches,branch-miss,instructions ./findminmax_stackoverflow faster _randomsmall 
time: 7.772994
count: 300000000
min: 9
max: 2147483647

 Performance counter stats for './findminmax_stackoverflow faster _randomsmall':

     1,177,042,675 branches                                                    
        77,686,346 branch-miss

     5,607,194,799 instructions              #    0.00  insns per cycle        

       6.834074812 seconds time elapsed

So, as I guessed, small numbers are actually slower to process than large ones, even if they are all contained 64 bits words. There are much more branch misses with "small" numbers, for a reason that probably only CPU designers would be able to tell :-).

Another noticable thing is that often the elapsed time measured by perf(1) is sometimes smaller than the one measured by the program itself. I think this is explained by the fact that the program itself uses real time, whereas perf(1) uses the process time (the time the process actually runs). I tried to use getrusage(2), but the time I get here do not match (for instance I get 1.6s as user time and 1.4s as system time, whereas perf(1) measures 6.8s).

Conclusion

  • No big practical difference between the two algorithm in term on execution time, although the trivial one has far less cache misses (by an order of magnitude) than the "faster" one, but this seems to be balanced by an increased amount of instructions (10-20%);
  • More specifically, this cache misses difference can be seen only in the random case: best and worst cases seem to be degenerated with this regard as they leads to the same amount of cache misses for both algorithms; therefore the "faster" algorithm is indeed slighty faster than the trivial one in these cases; in the "common" random case, the trivial algorithm is a little bit faster;
  • Small integers generate far more cache misses on my CPU;
  • There are not real difference between the worst case and the random test case with small int.

So, if you got that far, thank you :). I am sorry though that this all experimentation led to only vague conclusion. Hopefully enlightened readers will sched some light on this :).

-- Jeremie