C fprintf/fscanf optimising speed for big files

2019-07-15 19:46发布

问题:

I merge two big(near 8 GB every) files into one. I try to optimise it as good as it's only possible.

void merge() {

   char *array[17]= {"q.out","b.out"}; // names of input files
   FILE *finpt1 = fopen(array[0],"r"), *finpt2 = fopen (array[1],"r"), 
                                     *foutp = fopen("final_.out","w");

   u_int32_t a,b;

   fscanf(finpt1, "%u", &a);
   fscanf(finpt2, "%u", &b);
   int EOF1_my = 0, EOF2_my = 0;

   while (true) {
     if ( a>b ) {
         fprintf( foutp,"%u\n", b);
         if ( fscanf(finpt2, "%u", &b)  == EOF) { EOF2_my = EOF; break; }
      } else  {
        fprintf( foutp,"%u\n", a);
        if ( fscanf(finpt1, "%u", &a) == EOF) { EOF1_my = EOF; break; }
       }
   }

    if ( EOF1_my == EOF) {
      while ( fscanf(finpt2, "%u", &a) != EOF)
         fprintf(foutp, "%u\n", a);
     } else if ( EOF2_my == EOF) {
         while (fscanf(finpt1, "%u", &b) != EOF)
          fprintf( foutp,"%u\n", b);
    }

   fclose(finpt1); fclose(finpt2); fclose(foutp);
}

I suspect that calling printf many times consumes significant resources (I noticed that my programs with logging works drastically slower than without as a rule). And I think that most its time it spends formatting string (not writing to a file, because buffering is used).

So I wonder whether it would be better to form strings to output by myself in memory and write, e.g. 10000 symbols to a file for one appealing to fprintf function- like fprintf("%s", string);?

I have the same doubts about fscanf. Perhaps I should use some other functions?

Any thoughts are welcomed. Thanks in advance!

FIXING BUG
Thanks to sfstewman (noticed in comments to question). Cool, it's really valuable information that I wouldn't notice until I wouldn't have started writing of tests (or possible never)
Thanks you for your code, but anyway giving me ready code you leave me without fun.
It's my slice of the cake!
Idea is much more valuable, now I know what lexicographical comparison is for )

回答1:

Your inputs are all unsigned numbers. This means that you can use a lexicographic comparison instead of a numeric comparison.

To make the lexicographic comparison work for strings of unsigned numbers, first compare the lengths of the strings (shorter strings are smaller numbers). If the lengths are equal, strcmp will indicate which string which has the smaller number.

If you use newline as the delimiter between numbers, you can use fgets and fputs to read/write, eliminating the cost of the formatting in fscanf and fprintf. This eliminates all conversions between strings a numbers. The newline at the end of the string returned by fgets is constant across all numbers, and won't affect the lexicographic comparison.

I generated two 9.5M files of random unsigned numbers separated by a newline, and ran timing comparisons (merge1 is your above code, merge2 is below):

% time ./merge1
./merge1  0.89s user 0.08s system 99% cpu 0.974 total
% time ./merge2
./merge2  0.18s user 0.08s system 98% cpu 0.264 total

And, on a larger test set (two 537M files of random numbers between 0 and 2^30-1):

% time ./merge1
./merge1  51.22s user 4.57s system 54% cpu 1:41.68 total
% time ./merge2
./merge2  11.13s user 4.68s system 18% cpu 1:26.81 total

This suggests that numerical conversion is taking something like 75%-80% of your time. If this isn't fast enough, you can probably optimize this further by doing your own buffering and searching for the delimiters with strchr, and possibly by using memory-mapped files.

#include <stdio.h>
#include <string.h>

void merge()
{
   char *array[17]= {"q.out","b.out"}; // names of input files
   FILE *finpt1 = fopen(array[0],"r"), *finpt2 = fopen (array[1],"r"), 
                                     *foutp = fopen("final_.out","w");

   char buf1[32];
   char buf2[32];

   memset(buf1,0,sizeof(buf1));
   memset(buf2,0,sizeof(buf2));

   int EOF1_my = (fgets(buf1, sizeof(buf1), finpt1) == NULL);
   int EOF2_my = (fgets(buf2, sizeof(buf2), finpt2) == NULL);

   size_t l1 = strlen(buf1);
   size_t l2 = strlen(buf2);

   if (!EOF1_my && !EOF2_my)
   {
     for(;;)
     {
       /* unsigned numbers, so use a lexographic comparison */
       int diff = (l1 == l2) ? strcmp(buf1,buf2) : l1-l2;

       if (diff < 0)
       {
         fputs( buf1, foutp );
         memset(buf1,0,sizeof(buf1));
         EOF1_my = (fgets(buf1, sizeof(buf1), finpt1) == NULL);
         if (EOF1_my) break;
         l1 = strlen(buf1);
       }
       else
       {
         fputs( buf2, foutp );
         memset(buf2,0,sizeof(buf2));
         EOF2_my = (fgets(buf2, sizeof(buf2), finpt2) == NULL);
         if (EOF2_my) break;
         l2 = strlen(buf2);
       }
     }
   }

   FILE* frest = NULL;
   if (!EOF1_my || !EOF2_my)
   {
     if (!EOF1_my)
     {
       fputs(buf1, foutp);
       frest = finpt1;
     }
     else
     {
       fputs(buf2, foutp);
       frest = finpt2;
     }

     memset(buf1,0,sizeof(buf1));

     while(fgets(buf1,sizeof(buf1),frest) != NULL)
     {
       fputs(buf1,foutp);
     }
   }

   fclose(finpt1); fclose(finpt2); fclose(foutp);
}

int main()
{
  merge();
  return 0;
}