I am currently writing a program in c++ which includes reading lots of large text files. Each has ~400.000 lines with in extreme cases 4000 or more characters per line. Just for testing, I read one of the files using ifstream and the implementation offered by cplusplus.com. It took around 60 seconds, which is way too long. Now I was wondering, is there a straightforward way to improve reading speed?
edit: The code I am using is more or less this:
string tmpString;
ifstream txtFile(path);
if(txtFile.is_open())
{
while(txtFile.good())
{
m_numLines++;
getline(txtFile, tmpString);
}
txtFile.close();
}
edit 2: The file I read is only 82 MB big. I mainly said that it could reach 4000 because I thought it might be necessary to know in order to do buffering.
edit 3: Thank you all for your answers, but it seems like there is not much room to improve given my problem. I have to use readline, since I want to count the number of lines. Instantiating the ifstream as binary didn't make reading any faster either. I will try to parallelize it as much as I can, that should work at least.
edit 4: So apparently there are some things I can to. Big thank you to sehe for putting so much time into this, I appreciate it a lot! =)
Updates: Be sure to check the (surprising) updates below the initial answer
Memory mapped files have served me well1:
This should be rather quick.
Update
In case it helps you test this approach, here's a version using
mmap
directly instead of using Boost: see it live on ColiruUpdate
The last bit of performance I could squeeze out of this I found by looking at the source of GNU coreutils
wc
. To my surprise using the following (greatly simplified) code adapted fromwc
runs in about 84% of the time taken with the memory mapped file above:1 see e.g. the benchmark here: How to parse space-separated floats in C++ quickly?
Do you have to read all files at the same time? (at the start of your application for example)
If you do, consider parallelizing the operation.
Either way, consider using binary streams, or unbffered read for blocks of data.
Neil Kirk, unfortunately I can not reply to your comment (not enough reputation) but I did a performance test on ifstream an stringstream and the performance, reading a text file line by line, is exactly the same.
This takes 1426ms on a 106MB file.
This takes 1433ms on the same file.
The following code is faster instead:
This takes 884ms on the same file. It is just a little tricky since you have to set the maximum size of your buffer (i.e. maximum length for each line in the input file).
4000 * 400,000 = 1.6 GB if you're hard drive isn't an SSD you're likely getting ~100 MB/s sequential read. That's 16 seconds just in I/O.
Since you don't elaborate on the specific code your using or how you need to parse these files (do you need to read it line by line, does the system have a lot of RAM could you read the whole file into a large RAM buffer and then parse it?) There's little you can do to speed up the process.
Memory mapped files won't offer any performance improvement when reading a file sequentially. Perhaps manually parsing large chunks for new lines rather than using "getline" would offer an improvement.
EDIT After doing some learning (thanks @sehe). Here's the memory mapped solution I would likely use.
As someone with a little background in competitive programming, I can tell you: At least for simple things like integer parsing the main cost in C is locking the file streams (which is by default done for multi-threading). Use the
unlocked_stdio
versions instead (fgetc_unlocked()
,fread_unlocked()
). For C++, the common lore is to usestd::ios::sync_with_stdio(false)
but I don't know if it's as fast asunlocked_stdio
.For reference here is my standard integer parsing code. It's a lot faster than scanf, as I said mainly due to not locking the stream. For me it was as fast as the best hand-coded mmap or custom buffered versions I'd used previously, without the insane maintenance debt.
(Note: This one only works if there is precisely one non-digit character between any two integers).
And of course avoid memory allocation if possible...
Use
Random file access
or usebinary mode
. for sequential, this is big but still it depends on what you are reading.