I need to read the number of lines in a file before doing some operations on that file. When I try to read the file and increment the line_count variable at each iteration until i reach eof. It was not that fast in my case. I used both ifstream and fgets . They were both slow . Is there a hacky way to do this, which is also used by, for instance BSD, Linux kernel or berkeley db.(may be by using bitwise operations).
As I told before there are millions of lines in that file and it keeps get larger, each line has about 40 or 50 characters. I'm using Linux.
Note: I'm sure there will be people who might say use a DB idiot. But briefly in my case i can't use a db.
The thing that takes time is loading 40+ MB into memory. The fastest way to do that is to either memorymap it, or load it in one go into a big buffer. Once you have it in memory, one way or another, a loop traversing the data looking for
\n
characters is almost instantaneous, no matter how it is implemented.So really, the most important trick is to load the file into memory as fast as possible. And the fastest way to do that is to do it as a single operation.
Otherwise, plenty of tricks may exist to speed up the algorithm. If lines are only added, never modified or removed, and if you're reading the file repeatedly, you can cache the lines read previously, and the next time you have to read the file, only read the newly added lines.
Or perhaps you can maintain a separate index file showing the location of known '\n' characters, so those parts of the file can be skipped over.
Reading large amounts of data from the harddrive is slow. There's no way around that.
Don't use C++ stl strings and
getline
( or C's fgets), just C style raw pointers and either block read in page-size chunks or mmap the file.Then scan the block at the native word size of your system ( ie either
uint32_t
oruint64_t
) using one of the magic algorithms 'SIMD Within A Register (SWAR) Operations' for testing the bytes within the word. An example is here; the loop with the0x0a0a0a0a0a0a0a0aLL
in it scans for line breaks. ( that code gets to around 5 cycles per input byte matching a regex on each line of a file )If the file is only a few tens or a hundred or so megabytes, and it keeps growing (ie something keeps writing to it), then there's a good likelihood that linux has it cached in memory, so it won't be disk IO limited, but memory bandwidth limited.
If the file is only ever being appended to, you could also remember the number of lines and previous length, and start from there.
It has been pointed out that you could use mmap with C++ stl algorithms, and create a functor to pass to std::foreach. I suggested that you shouldn't do it not because you can't do it that way, but there is no gain in writing the extra code to do so. Or you can use boost's mmapped iterator, which handles it all for you; but for the problem the code I linked to was written for this was much, much slower, and the question was about speed not style.
It isn't slow because of your algorithm , It is slow because IO operations are slow. I suppose you are using a simple O(n) algorithm that is simply going over the file sequentially. In that case , there is no faster algorithm that can optimize your program.
However , I said there is no faster algorithm , but there is a faster mechanism which called "Memory Mapped file " , There are some drawback for mapped files and it might not be appropiate for you case , So you'll have to read about it and figure out by yourself.
Memory mapped files won't let you implement an algorithm better then O(n) but it may will reduce IO access time.
Remember that all fstreams are buffered. So they in-effect do actually reads in chunks so you do not have to recreate this functionality. So all you need to do is scan the buffer. Don't use getline() though as this will force you to size a string. So I would just use the STL std::count and stream iterators.
You wrote that it keeps get larger. This sounds like it is a log file or something similar where new lines are appended but exisiting lines are not changed. If this is the case you could try an incremental approach.
Parse to the end of file. Remember the line count and the offset of EOF. When the file grows
fseek
to the offset, parse to EOF and update the line count and the offset.You can only get a definitive answer by scanning the entire file looking for newline characters. There's no way around that.
However, there are a couple of possibilities which you may want to consider.
1/ If you're using a simplistic loop, reading one character at a time checking for newlines, don't. Even though the I/O may be buffered, function calls themselves are expensive, time-wise.
A better option is to read large chunks of the file (say 5M) into memory with a single I/O operation, then process that. You probably don't need to worry too much about special assembly instruction since the C runtime library will be optimized anyway - a simple
strchr()
should do it.2/ If you're saying that the general line length is about 40-50 characters and you don't need an exact line count, just grab the file size and divide by 45 (or whatever average you deem to use).
3/ If this is something like a log file and you don't have to keep it in one file (may require rework on other parts of the system), consider splitting the file periodically.
For example, when it gets to 5M, move it (e.g.,
x.log
) to a dated file name (e.g.,x_20090101_1022.log
) and work out how many lines there are at that point (storing it inx_20090101_1022.count
, then start a newx.log
log file. Characteristics of log files mean that this dated section that was created will never change so you will never have to recalculate the number of lines.To process the log "file", you'd just
cat x_*.log
through some process pipe rather thancat x.log
. To get the line count of the "file", do awc -l
on the current x.log (relatively fast) and add it to the sum of all the values in thex_*.count
files.