Fastest way to find the number of lines in a text

2019-01-10 19:44发布

I need to read the number of lines in a file before doing some operations on that file. When I try to read the file and increment the line_count variable at each iteration until i reach eof. It was not that fast in my case. I used both ifstream and fgets . They were both slow . Is there a hacky way to do this, which is also used by, for instance BSD, Linux kernel or berkeley db.(may be by using bitwise operations).

As I told before there are millions of lines in that file and it keeps get larger, each line has about 40 or 50 characters. I'm using Linux.

Note: I'm sure there will be people who might say use a DB idiot. But briefly in my case i can't use a db.

8条回答
萌系小妹纸
2楼-- · 2019-01-10 19:48

The thing that takes time is loading 40+ MB into memory. The fastest way to do that is to either memorymap it, or load it in one go into a big buffer. Once you have it in memory, one way or another, a loop traversing the data looking for \n characters is almost instantaneous, no matter how it is implemented.

So really, the most important trick is to load the file into memory as fast as possible. And the fastest way to do that is to do it as a single operation.

Otherwise, plenty of tricks may exist to speed up the algorithm. If lines are only added, never modified or removed, and if you're reading the file repeatedly, you can cache the lines read previously, and the next time you have to read the file, only read the newly added lines.

Or perhaps you can maintain a separate index file showing the location of known '\n' characters, so those parts of the file can be skipped over.

Reading large amounts of data from the harddrive is slow. There's no way around that.

查看更多
We Are One
3楼-- · 2019-01-10 19:49

Don't use C++ stl strings and getline ( or C's fgets), just C style raw pointers and either block read in page-size chunks or mmap the file.

Then scan the block at the native word size of your system ( ie either uint32_t or uint64_t) using one of the magic algorithms 'SIMD Within A Register (SWAR) Operations' for testing the bytes within the word. An example is here; the loop with the 0x0a0a0a0a0a0a0a0aLL in it scans for line breaks. ( that code gets to around 5 cycles per input byte matching a regex on each line of a file )

If the file is only a few tens or a hundred or so megabytes, and it keeps growing (ie something keeps writing to it), then there's a good likelihood that linux has it cached in memory, so it won't be disk IO limited, but memory bandwidth limited.

If the file is only ever being appended to, you could also remember the number of lines and previous length, and start from there.


It has been pointed out that you could use mmap with C++ stl algorithms, and create a functor to pass to std::foreach. I suggested that you shouldn't do it not because you can't do it that way, but there is no gain in writing the extra code to do so. Or you can use boost's mmapped iterator, which handles it all for you; but for the problem the code I linked to was written for this was much, much slower, and the question was about speed not style.

查看更多
时光不老,我们不散
4楼-- · 2019-01-10 19:52

It isn't slow because of your algorithm , It is slow because IO operations are slow. I suppose you are using a simple O(n) algorithm that is simply going over the file sequentially. In that case , there is no faster algorithm that can optimize your program.

However , I said there is no faster algorithm , but there is a faster mechanism which called "Memory Mapped file " , There are some drawback for mapped files and it might not be appropiate for you case , So you'll have to read about it and figure out by yourself.

Memory mapped files won't let you implement an algorithm better then O(n) but it may will reduce IO access time.

查看更多
时光不老,我们不散
5楼-- · 2019-01-10 19:58

Remember that all fstreams are buffered. So they in-effect do actually reads in chunks so you do not have to recreate this functionality. So all you need to do is scan the buffer. Don't use getline() though as this will force you to size a string. So I would just use the STL std::count and stream iterators.

#include <iostream>
#include <fstream>
#include <iterator>
#include <algorithm>


struct TestEOL
{
    bool operator()(char c)
    {
        last    = c;
        return last == '\n';
    }
    char    last;
};

int main()
{
    std::fstream  file("Plop.txt");

    TestEOL       test;
    std::size_t   count   = std::count_if(std::istreambuf_iterator<char>(file),
                                          std::istreambuf_iterator<char>(),
                                          test);

    if (test.last != '\n')  // If the last character checked is not '\n'
    {                       // then the last line in the file has not been 
        ++count;            // counted. So increement the count so we count
    }                       // the last line even if it is not '\n' terminated.
}
查看更多
\"骚年 ilove
6楼-- · 2019-01-10 19:59

You wrote that it keeps get larger. This sounds like it is a log file or something similar where new lines are appended but exisiting lines are not changed. If this is the case you could try an incremental approach.

Parse to the end of file. Remember the line count and the offset of EOF. When the file grows fseek to the offset, parse to EOF and update the line count and the offset.

查看更多
smile是对你的礼貌
7楼-- · 2019-01-10 20:00

You can only get a definitive answer by scanning the entire file looking for newline characters. There's no way around that.

However, there are a couple of possibilities which you may want to consider.

1/ If you're using a simplistic loop, reading one character at a time checking for newlines, don't. Even though the I/O may be buffered, function calls themselves are expensive, time-wise.

A better option is to read large chunks of the file (say 5M) into memory with a single I/O operation, then process that. You probably don't need to worry too much about special assembly instruction since the C runtime library will be optimized anyway - a simple strchr() should do it.

2/ If you're saying that the general line length is about 40-50 characters and you don't need an exact line count, just grab the file size and divide by 45 (or whatever average you deem to use).

3/ If this is something like a log file and you don't have to keep it in one file (may require rework on other parts of the system), consider splitting the file periodically.

For example, when it gets to 5M, move it (e.g., x.log) to a dated file name (e.g., x_20090101_1022.log) and work out how many lines there are at that point (storing it in x_20090101_1022.count, then start a new x.log log file. Characteristics of log files mean that this dated section that was created will never change so you will never have to recalculate the number of lines.

To process the log "file", you'd just cat x_*.log through some process pipe rather than cat x.log. To get the line count of the "file", do a wc -l on the current x.log (relatively fast) and add it to the sum of all the values in the x_*.count files.

查看更多
登录 后发表回答