I merge two big(near 8 GB every) files into one. I try to optimise it as good as it's only possible.
void merge() {
char *array[17]= {"q.out","b.out"}; // names of input files
FILE *finpt1 = fopen(array[0],"r"), *finpt2 = fopen (array[1],"r"),
*foutp = fopen("final_.out","w");
u_int32_t a,b;
fscanf(finpt1, "%u", &a);
fscanf(finpt2, "%u", &b);
int EOF1_my = 0, EOF2_my = 0;
while (true) {
if ( a>b ) {
fprintf( foutp,"%u\n", b);
if ( fscanf(finpt2, "%u", &b) == EOF) { EOF2_my = EOF; break; }
} else {
fprintf( foutp,"%u\n", a);
if ( fscanf(finpt1, "%u", &a) == EOF) { EOF1_my = EOF; break; }
}
}
if ( EOF1_my == EOF) {
while ( fscanf(finpt2, "%u", &a) != EOF)
fprintf(foutp, "%u\n", a);
} else if ( EOF2_my == EOF) {
while (fscanf(finpt1, "%u", &b) != EOF)
fprintf( foutp,"%u\n", b);
}
fclose(finpt1); fclose(finpt2); fclose(foutp);
}
I suspect that calling printf many times consumes significant resources (I noticed that my programs with logging works drastically slower than without as a rule). And I think that most its time it spends formatting string (not writing to a file, because buffering is used).
So I wonder whether it would be better to form strings to output by myself in memory and write, e.g. 10000 symbols to a file for one appealing to fprintf function- like fprintf("%s", string);?
I have the same doubts about fscanf. Perhaps I should use some other functions?
Any thoughts are welcomed. Thanks in advance!
FIXING BUG
Thanks to sfstewman (noticed in comments to question).
Cool, it's really valuable information that I wouldn't notice until I wouldn't have started writing of tests (or possible never)
Thanks you for your code, but anyway giving me ready code you leave me without fun.
It's my slice of the cake!
Idea is much more valuable, now I know what lexicographical comparison is for )
Your inputs are all unsigned numbers. This means that you can use a lexicographic comparison instead of a numeric comparison.
To make the lexicographic comparison work for strings of unsigned numbers, first compare the lengths of the strings (shorter strings are smaller numbers). If the lengths are equal,
strcmp
will indicate which string which has the smaller number.If you use newline as the delimiter between numbers, you can use
fgets
andfputs
to read/write, eliminating the cost of the formatting infscanf
andfprintf
. This eliminates all conversions between strings a numbers. The newline at the end of the string returned byfgets
is constant across all numbers, and won't affect the lexicographic comparison.I generated two 9.5M files of random unsigned numbers separated by a newline, and ran timing comparisons (merge1 is your above code, merge2 is below):
And, on a larger test set (two 537M files of random numbers between 0 and 2^30-1):
This suggests that numerical conversion is taking something like 75%-80% of your time. If this isn't fast enough, you can probably optimize this further by doing your own buffering and searching for the delimiters with
strchr
, and possibly by using memory-mapped files.