I want to process each line of a file on a hard-disk now. Is it better to load a file as a whole and then split on basis of newline character (using boost), or is it better to use getline()
? My question is does getline()
reads single line when called (resulting in multiple hard disk access) or reads whole file and gives line by line?
相关问题
- Sorting 3 numbers without branching [closed]
- How to compile C++ code in GDB?
- Why does const allow implicit conversion of refere
- thread_local variables initialization
- What uses more memory in c++? An 2 ints or 2 funct
相关文章
- Class layout in C++: Why are members sometimes ord
- How to mock methods return object with deleted cop
- Which is the best way to multiply a large and spar
- C++ default constructor does not initialize pointe
- Selecting only the first few characters in a strin
- What exactly do pointers store? (C++)
- Converting glm::lookat matrix to quaternion and ba
- What is the correct way to declare and use a FILE
getline
will callread()
as a system call somewhere deep in the gutst of the C library. Exactly how many times it is called, and how it is called depends on the C library design. But most likely there is no distinct difference in reading a line at a time vs. the whole file, becuse the OS at the bottom layer will read (at least) one disk-block at a time, and most likely at least a "page" (4KB), if not more.Further, unles you do nearly nothing with your string after you have read it (e.g you are writing something like "grep", so mostly just reading the to find a string), it is unlikely that the overhead of reading a line at a time is a large part of the time you spend.
But the "load the whole file in one go" has several, distinct, problems:
Don't try to optimise something unless you have used profiling to prove that it's part of why your code is running slow. You are just causing more problems for yourself.
Edit: So, I wrote a program to measure this, since I think it's quite interesting.
And the results are definitely interesting - to make the comparison fair, I created three large files of 1297984192 bytes each (by copying all source files in a directory with about a dozen different source files, then copying this file several times over to "multiply" it, until it took over 1.5 seconds to run the test, which is how long I think you need to run things to make sure the timing isn't too susceptible to random "network packet came in" or some other outside influences taking time out of the process).
I also decided to measure the system and user-time by the process.
Here's the three different functions to read the file (there's some code to measure time and stuff too, of course, but for reducing the size of this post, I choose to not post all of that - and I played around with ordering to see if that made any difference, so results above are not in the same order as the functions here)
The OS will read a whole block of data (depending on how the disk is formatted, typically 4-8k at a time) and do some of the buffering for you. Let the OS take care of it for you, and read the data in the way that makes sense for your program.
I believe the C++ idiom would be to read the file line-by-line, and create a line-based container as you read the file. Most likely the iostreams (
getline
) will be buffered enough that you won't notice a significant difference.However for very large files you may get better performance by reading larger chunks of the file (not the whole file at once) and splitting internall as newlines are found.
If you want to know specifically which method is faster and by how much, you'll have to profile your code.
If it's a small file on disk, it's probably more efficient to read the entire file and parse it line by line vs. reading one line at a time--that would take lot's of disk access.
The fstreams are buffered reasonably. The underlying acesses to the harddisk by the OS are buffered reasonably. The hard disk itself has a reasonable buffer. You most surely will not trigger more hard disk accesses if you read the file line by line. Or character by character, for that matter.
So there is no reason to load the whole file into a big buffer and work on that buffer, because it already is in a buffer. And there often is no reason to buffer one line at a time, either. Why allocate memory to buffer something in a string that is already buffered in the ifstream? If you can, work on the stream directly, don't bother tossing everything around twice or more from one buffer to the next. Unless it supports readability and/or your profiler told you that disc access is slowing your program down significantly.
Its better to fetch the all data if it can be accommodated in memory because whenever you request the I/O your programmme looses the processing and put in a wait Q.
However if the file size is big then it's better to read as much data at a time which is required in processing. Because bigger read operation will take much time to complete then the small ones. cpu process switching time is much smaller then this entire file read time.