Can't make sense of the varying results of exp

2019-08-06 06:59发布

问题:

It all started with this question -> How to read blocks of data from a file and then read from that block into a vector?

With the aim of minimizing disk I/O operations, I performed a few experiments to see if size of buffer has any kind of effect on the time taken by program.

I used the following two codes, one more c-oriented and another more c++ (though both compiled with gcc):-

The c oriented code:-

int buffer_size=1024;
FILE *file;
file = fopen(argv[1], "r");
FILE *out_file;
out_file = fopen("in", "w");
char out_buffer[2048];
setvbuf(out_file, out_buffer, _IOFBF, buffer_size);
char buffer[2048];
setvbuf(file, buffer, _IOFBF, buffer_size);
while (!feof(file)) 
{
 char sl[1000];
 fgets(sl, 140 , file);
 fputs(sl, out_file);

}

The c code gave the following results (for a 14 mb file):-

Buffer_size      Time
10               18 sec
100              2 sec
1024             0.4 sec              
10240            0.3 sec 

(for a 103 mb file)

1024             ~8 sec
5120             ~3 sec
10240            ~3 sec
15360            ~3 sec

It seems to reach a saturation point at buffer size of about 5 mb. Any particular reason for this?

The c++ oriented code:-

int buffer_size=1024;
ifstream in_file(argv[1]);
char in_buffer[buffer_size];
in_file.rdbuf()->pubsetbuf(in_buffer,sizeof(in_buffer));
ofstream out_file("in");
char out_buffer[buffer_size];
out_file.rdbuf()->pubsetbuf(out_buffer,sizeof(in_buffer));
while(!in_file.eof())
{
    char sl[1024];
    in_file >> sl;
    out_file << sl<<endl;
}

My test input file was a 14mb file with 1000000 lines.

Buffer_size      Time (~)
10               6.5 sec
100              6.5 sec
1024             6.5 sec  

C++ does not seem to care about the buffer size at all. Why?

Also, the C++ code is about 15 times slower (when the buffer size in C is 1 mb)! Is ifstream usually slower than FILE (other answers on SO seem to suggest that there is no difference)? Or is there something else in the code that is causing the slowness?

回答1:

Fundamentally, the amount of time spent writing is estimated by a formula of the form:

T = C1*nsyscalls + C2*nbytes

In reality, C1 is a very large constant (cost per syscall) and C2 is a very small constant (cost per byte). The size of your buffer affects the magnitude of the ratio nsyscalls/nbytes; larger buffers make it smaller. The goal of buffering is to have nsyscalls be sufficiently small relative to nbytes that the second term dominates the first term and you're left with T = (C2+epsilon)*nbytes. Once the buffer is sufficiently large that the second term dominates, increasing the buffer size further will not get you any significant performance gains.



回答2:

Formatted input >> in iostreams is known to be pretty slow. But the problem is you do not compare apples to apples as istream >> std::string or char * is reading a word separated by space, not what fgets does. So use std::getline for std::string or istream::getline() for char * to have similar functionality and your comparison will make more sense.

PS On this example pubsetbuf() is called before file is opened. This may be the reason you do not observe any change of read speed in your code as you call pubsetbuf() after file is opened.



回答3:

The problem with your C++ code is that there's no way to set the buffer size in a streambuf. Calling pubsetbuf(0, 0) will make an output filestream unbuffered, but using any other values won't do anything in particular. From the spec:

basic_streambuf* setbuf(char_type* s, streamsize n);

Effects: If setbuf(0,0) is called on a stream before any I/O has occurred on that stream, the stream becomes unbuffered. Otherwise the results are implementation-defined. “Unbuffered” means that pbase() and pptr() always return null and output to the file should appear as soon as possible.

It looks like in your case, the implementation ignores setbuf...