Here's a bit of code that is a considerable bottleneck after doing some measuring:

//-----------------------------------------------------------------------------
//  Construct dictionary hash set from dictionary file
//-----------------------------------------------------------------------------
void constructDictionary(unordered_set<string> &dict)
{
    ifstream wordListFile;
    wordListFile.open("dictionary.txt");

    std::string word;
    while( wordListFile >> word )
    {
        if( !word.empty() )
        {
            dict.insert(word);
        }
    }

    wordListFile.close();
}

I'm reading in ~200,000 words and this takes about 240 ms on my machine. Is the use of ifstream here efficient? Can I do better? I'm reading about mmap() implementations but I'm not understanding them 100%. The input file is simply text strings with *nix line terminations.

EDIT: Follow-up question for the alternatives being suggested: Would any alternative (minus increasing the stream buffer sizes) imply that I write a parser that examines each character for new-lines? I kind of like the simple syntax of streams, but I can re-write something more nitty-gritty if I have to for speed. Reading the entire file in to memory is a viable option, it's only about 2mb.

EDIT #2: I've found that the slow down for me was due to the set insert, but for those who are still interested in speeding up line by line file IO, please read the answers here AND check out Matthieu M.'s continuation on the topic.

标签： c++ optimization file-io ifstream

9条回答

Deceive 欺骗

2楼-- · 2019-02-04 18:38

Quick profiling on my system (linux-2.6.37, gcc-4.5.2, compiled with -O3) shows that I/O is not the bottleneck. Whether using fscanf into a char array followed by dict.insert() or operator>> as in your exact code, it takes about the same time (155 - 160 ms to read a 240k word file).

Replacing gcc's std::unordered_set with std::vector<std::string> in your code drops the execution time to 45 ms (fscanf) - 55 ms (operator>>) for me. Try to profile IO and set insertion separately.

0人赞添加讨论(0) 举报

何必那么认真

3楼-- · 2019-02-04 18:39

My system (3.2.0-52-generic, g++-4.7 (Ubuntu/Linaro 4.7.3-2ubuntu1~12.04) 4.7.3, compiled with -O2 if not specified, CPU: i3-2125)

In my test cases I used 295068 words dictionary (so, there are 100k more words than in yours): http://dl.dropboxusercontent.com/u/4076606/words.txt

From time complexity point of view:

Worst case your program complexity: O(n*n)=O(n[read data]*n[insert into unordered_set])
Average case your program complexity: O(n)=O(n[read data]*1[insert into unordered_set])

Practical tips:

Most simple data structure have less time overhead. Simple array is faster than vector. char array is faster than string. There are plenty of info in the web about it.

My measurements:

Notice: I didn't flush my OS cache & HDD cache. The last one I can't control, but first one can be controlled with:

sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

Also I didn't omit those measurements that included a lot of context-switches and so on. So, there is space to do better measurements.

My code:

14–16 ms to read from file & insert data to a 2D char array (read & insert) n times

65-75 ms to search with binary search all the words (search n times):

Total=79-91 ms

61-78 ms to read from file & insert data to a unordered_set char array (read & insert) n times

7-9 ms to search by hash n times

Total=68-87 ms

If you search more times than you insert choose hash table (unordered_set) otherwise binary search (with simple array).

Your original code (search & insert):

Compiled with -O2: 157-182 ms

Compiled with -O0 (if you omit -O flag, "-O" level by default is also 0): 223-248 ms

So, compiler options also matters, in this case it means 66 ms speed boost. You didn't specified any of them. So, my best guess is you didn't used it. As I try to answer to your main question.

What you can do most simple, but better with your current code?

[better usage of high level API] Use "getline(wordListFile, word)" instead of "wordListFile >> word". Also I think "getline" is more readable than the ">>" operator.

Compiled with -O2: 142-170 ms. ~ 12-15 ms speed boost compared with your original code.

Compiled with -O0 (if you omit -O flag, "-O" level by default is also 0): 213-235 ms. ~ 10-13 ms speed boost compared with your original code.

[better usage of high level API] Avoid rehashing with "dict.reserve(XXXXXX);", @David Rodríguez - dribeas also mentioned about it. If your dictionary is static or if you can guess your dictionary size (for example by file size divided by average word length). First run without "reserve" and outputted bucket_count (cout << "bucket_count = " << dict.bucket_count() << "\n";), in my case it is 299951. Then I added "dict.reserve(299951);".

Compiled with -O2: 99-121-[137] ms. ~ 33-43-[49] ms speed boost compared with your original code.

What you can do more advanced to speed it up?

Implement your own hash function for your specific data input. Use char array instead of STL string. After you did it, only then write code with direct OS I/O. As you noticed (from my measurements also can be seen) that data structure is the bottleneck. If the media is very slow, but CPU is very fast, compress the file uncompress it in your program.

My code is not perfect but still it is better than anything can be seen above: http://pastebin.com/gQdkmqt8 (hash function is from the web, can be also done better)

Could you provide more details about for what system (one or range) do you plan to optimize?

Info of time complexities: Should be links... But I don't have so much reputation as I'm beginner in stackoverflow.

Is my answer still relevant to anything? Please, add a comment or vote as there is no PM as I see.

0人赞添加讨论(0) 举报

走好不送

4楼-- · 2019-02-04 18:39

Unfortunately, there's not much you can do to increase performance when using an fstream.

You may be able to get a very slight speed improvement by reading in larger chunks of the file and then parsing out single words, but this depends on how your fstream implementation does buffering.

The only way to get a big improvement is to use your OS's I/O functions. For example, on Windows, opening the file with the FILE_FLAG_SEQUENTIAL_SCAN flag may speed up reads, as well as using asynchronous reads to grab data from disk and parse it in parallel.

0人赞添加讨论(0) 举报

家丑人穷心不美

5楼-- · 2019-02-04 18:40

Reading the whole file in one go into memory and then operating on it in would probably be faster as it avoids repeatedly going back to the disk to read another chunk.

Is 0.25s actually a problem? If you're not intending on loading much larger files is there any need to make it faster if it makes it less readable?

0人赞添加讨论(0) 举报

叛逆

6楼-- · 2019-02-04 18:42

The C++ and C libraries read stuff off the disk equally fast and are already buffered to compensate for the disk I/O lag. You are not going to make it faster by adding more buffering.

The biggest difference is that C++ streams does a load of manipulations based on the locale. Character conversions/Punctuational etc/etc.

As a result the C libraries will be faster.

Replaced Dead Link

For some reason the linked question was deleted. So I am moving the relevant information here. The linked question was about hidden features in C++.

Though not techncially part of the STL.
The streams library is part of the standard C++ libs.

For streams:

Locales.

Very few people actually bother to learn how to correctly set and/or manipulate the locale of a stream.

The second coolest thing is the iterator templates.
Most specifically for me is the stream iterators, which basically turn the streams into very basic containers that can then be used in conjunction with the standard algorithms.

Examples:

Did you know that locales will change the '.' in a decimal number to any other character automatically.
Did you know that locales will add a ',' every third digit to make it easy to read.
Did you know that locales can be used to manipulate the text on the way through (ie conversion from UTF-16 to UTF-8 (when writting to a file).

etc.

Examples:

0人赞添加讨论(0) 举报

成全新的幸福

7楼-- · 2019-02-04 18:46

Believe it or not, the performance of the stdlib stream in reading data is far below that of the C library routines. If you need top IO read performance, don't use c++ streams. I discovered this the hard way on algorithm competition sites -- my code would hit the test timeout using c++ streams to read stdin, but would finish in plenty of time using plain C FILE operations.

Edit: Just try out these two programs on some sample data. I ran them on Mac OS X 10.6.6 using g++ i686-apple-darwin10-g++-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5664) on a file with 1 million lines of "howdythere", and the scanf version runs consistently 5 times faster than the cin version:

#include <stdio.h>

int main()
{
    int count = 0;
    char buf[1024];
    while ( scanf("%s", buf) == 1 )
        ++ count;

    printf( "%d lines\n", count );
}

and

#include <iostream>

int main()
{
    char buf[1024];
    int count = 0;

    while ( ! std::cin.eof() )
    {
        std::cin.getline( buf, 1023 );
        if ( ! std::cin.eof() )
            ++count;
    }
   std::cout << count << " lines" << std::endl;
}

Edit: changed the data file to "howdythere" to eliminate the difference between the two cases. The timing results did not change.

Edit: I think the amount of interest (and the downvotes) in this answer shows how contrary to popular opinion the reality is. People just can't believe that the simple case of reading input in both C and streams can be so different. Before you downvote: go measure it yourself. The point is not to set tons of state (that nobody typically sets), but just the code that people most frequently write. Opinion means nothing in performance: measure, measure, measure is all that matters.

0人赞添加讨论(0) 举报

1 2 下一页

How can I speed up line by line reading of an ASCI

My measurements:

My code:

Your original code (search & insert):

What you can do most simple, but better with your current code?

What you can do more advanced to speed it up?

Replaced Dead Link

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间