Reading key-value pairs as fast as possible in C++

2019-09-18 10:27发布

问题:

I have a file with roughly 2 million lines like this:

2s,3s,4s,5s,6s 100000
2s,3s,4s,5s,8s 101
2s,3s,4s,5s,9s 102

The first comma separated part indicates a poker result in Omaha, while the latter score is an example "value" of the cards. It is very important for me to read this file as fast as possible in C++, but I cannot seem to get it to be faster than a simple approach in Python (4.5 seconds) using the base library.

Using the Qt framework (QHash and QString), I was able to read the file in 2.5 seconds in release mode. However, I do not want to have the Qt dependency. The goal is to allow quick simulations using those 2 million lines, i.e. some_container["2s,3s,4s,5s,6s"] to yield 100 (though if applying a translation function or any non-readable format will allow for faster reading that's okay as well).

My current implementation is extremely slow (8 seconds!):

std::map<std::string, int> get_file_contents(const char *filename)
{
    std::map<std::string, int> outcomes;
    std::ifstream infile(filename);

    std::string c;
    int d;

    while (infile.good())
    {
        infile >> c;
        infile >> d;
        //std::cout << c << d << std::endl;
        outcomes[c] = d;
    }
    return outcomes;
}

What can I do to read this data into some kind of a key/value hash as fast as possible?

Note: The first 16 characters are always going to be there (the cards), while the score can go up to around 1 million.

Some further informations gathered from various comments:

  • sample file: http://pastebin.com/rB1hFViM
  • ram restrictions: 750MB
  • initialization time restriction: 5s
  • computation time per hand restriction: 0.5s

回答1:

As I see it, there are two bottlenecks on your code.

1 Bottleneck

I believe that the file reading is the biggest problem there. Having a binary file is the fastest option. Not only you can read it directly in an array with a raw istream::read in a single operation (which is very fast), but you can even map the file in memory if your OS supports it. Here is a link that's very informative on how to use memory mapped files.


2 Bottleneck

The std::map is usually implemented with a self-balancing BST that will store all the data in order. This makes the insertion to be an O(logn) operation. You can change it to std::unordered_map, wich uses a hash table instead. A hash table have a constant time insertion if the number of colisions are low. As the ammount of elements that you need to read is known, you can reserve a suitable ammount of chuncks before inserting the elements. Keep in mind that you need more chuncks than the number of elements that will be inserted in the hash to avoid the maximum ammount of colisions.



回答2:

Ian Medeiros already mentioned the two major botlenecks.

a few thoughts about data structures:

the amount of different cards is known: 4 colors of each 13 cards -> 52 cards. so a card requires less than 6 bits to store. your current file format currently uses 24 bit (includig the comma). so by simply enumerating the cards and omitting the comma you can save ~2/3 of file size and allows you to determine a card with reading only one character per card. if you want to keep the file text based you may use a-m, n-z, A-M and N-Z for the four colors.

another thing that bugs me is the string based map. string operations are innefficient. One hand contains 5 cards. that means 52^5 posiibilities if we keep it simple and do not consider the already drawn cards.

--> 52^5 = 380.204.032 < 2^32

that means we can enumuerate every possible hand with a uint32 number. by defining a special sorting scheme of the cards (since order is irrelevant), we can assign a number to the hand and use this number as key in our map that is a lot faster than using strings.

if we have enough memory (1.5 GB) we do not even need a map but we can simply use an array. of course the most cells are unused but access may be very fast. we even can ommit the ordering of the cards since the cells are present independet if we fill them or not. So we can use them. but in this case you should not forget to fill all possible permutations of the hand read from the file.

with this scheme we also (may be) can further optimize our file reading speed. if we only store the hands number and the rating so that only 2 values need to be parsed.

infact we can optimize the required storage space by using a more complex adressing scheme for the different hands, since in reality there are only 52*51*50*49*48 = 311.875.200 possible hands.additional to that the ordering is irrelevant as mentioned but i think that this saving is not worth the increased complexity of the encoding of the hands.



回答3:

A simple idea might be to use the C API, which is considerably simpler:

#include <cstdio>

int n;
char s[128];

while (std::fscanf(stdin, "%127s %d", s, &n) == 2)
{
    outcomes[s] = n;
}

A rough test showed a considerable speedup for me compared to the iostreams library.

Further speedups may be achieved by storing the data in a contiguous array, e.g. a vector of std::pair<std::string, int>; it depends on whether your data is already sorted and how you need to access it later.

For a serious solution, though, you should probably step back further and think of a better way to represent your data. For example, a fixed-width, binary encoding would be much more space-efficient and faster to parse, since you won't need to look ahead for line endings or parse strings.

Update: From some quick experimentation I've found it fairly fast to first read the entire file into memory and then perform alternating strtok calls with either " " or "\n" as the delimiter; whenever a pair of calls succeed, apply strtol on the second pointer to parse the integer. Here's a skeleton:

#include <cerrno>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <vector>

int main()
{
    std::vector<char> data;

    // Read entire file to memory
    {
        data.reserve(100000000);

        char buf[4096];
        for (std::size_t n; (n = std::fread(buf, 1, sizeof buf, stdin)) > 0; )
        {
            data.insert(data.end(), buf, buf + n);
        }
        data.push_back('\0');
    }

    // Tokenize the in-memory data
    char * p = &data.front();
    for (char * q = std::strtok(p, " "); q; q = std::strtok(nullptr, " "))
    {
        if (char * r = std::strtok(nullptr, "\n"))
        {
            char * e;
            errno = 0;
            int const n = std::strtol(r, &e, 10);
            if (*e != '\0' || errno != 0) { continue; }

            // At this point we have data:
            // * the string is "q"
            // * the integer is "n"
        }
    }
}