How to use std::string as key in stxxl::map

2019-08-09 06:46发布

问题:

I am trying to use std::string as a key in the stxxl::map The insertion was fine for small number of strings about 10-100. But while trying to insert large number of strings about 100000 in it, I am getting segmentation fault.

The code is as follows:

struct CompareGreaterString {
    bool operator () (const std::string& a, const std::string& b) const {
       return a > b;
    }
    static std::string max_value() {
       return "";
    } 
};

// template parameter <KeyType, DataType, CompareType, RawNodeSize, RawLeafSize, PDAllocStrategy (optional)>
typedef stxxl::map<std::string, unsigned int, CompareGreaterString, DATA_NODE_BLOCK_SIZE, DATA_LEAF_BLOCK_SIZE> name_map;
name_map strMap((name_map::node_block_type::raw_size)*3, (name_map::leaf_block_type::raw_size)*3);
for (unsigned int i = 0; i < 1000000; i++) { /// Inserting 1 million strings
    std::stringstream strStream;
    strStream << (i);
    Console::println("Inserting: " + strStream.str());
    strMap[strStream.str()]=i;
}

In here I am unable to identify why I am unable to insert more number of strings. I am getting segmentation fault exactly while inserting "1377". Plus I am able to add any number of integers as key. I feel that the variable size of string might be causing this trouble.

Also I am unable to understand what to return for max_value of the string. I simply returned a blank string.

回答1:

According to documentation:

CompareType must also provide a static max_value method, that returns a value of type KeyType that is larger than any key stored in map

Because empty string happens to compare as smaller than any other string, it breaks this precondition and may thus cause unspecified behaviour.

Here's a max_value that should work. MAX_KEY_LEN is just an integer which is larger or equal to the length of the longest possible string key that the map can have.

struct CompareGreaterString {
    // ...
    static std::string max_value() {
        return std::string(MAX_KEY_LEN, std::numeric_limits<unsigned char>::max());
    }
};


回答2:

I have finally found the solution to my problem with great help from Timo bingmann, user2079303 and Martin Ba. Thank you.

I would like to share it with you.

Firstly stxxl supports POD only. That means it stores fixed sized structures only. Hence std::string cannot be a key. stxxl::map worked for about 100-1000 strings because they were contained in the physical memory itself. When more strings are inserted it has to write on disk which is internally causing some problems.

Hence we need to use a fixed string using char[] as follows:

static const int MAX_KEY_LEN = 16;

class FixedString { 
public:
    char charStr[MAX_KEY_LEN];

    bool operator< (const FixedString& fixedString) const {
        return std::lexicographical_compare(charStr, charStr+MAX_KEY_LEN,
            fixedString.charStr, fixedString.charStr+MAX_KEY_LEN);
    }

    bool operator==(const FixedString& fixedString) const {
        return std::equal(charStr, charStr+MAX_KEY_LEN, fixedString.charStr);
    }

    bool operator!=(const FixedString& fixedString) const {
        return !std::equal(charStr, charStr+MAX_KEY_LEN, fixedString.charStr);
    } 
};

struct comp_type : public std::less<FixedString> {
    static FixedString max_value()
    {
        FixedString s;
        std::fill(s.charStr, s.charStr+MAX_KEY_LEN, 0x7f);
        return s;
    } 
};

Please note that all the operators mainly((), ==, !=) need to be overriden for all the stxxl::map functions to work Now we may define fixed_name_map for map as follows:

typedef stxxl::map<FixedString, unsigned int, comp_type, DATA_NODE_BLOCK_SIZE, DATA_LEAF_BLOCK_SIZE> fixed_name_map;
fixed_name_map myFixedMap((fixed_name_map::node_block_type::raw_size)*5, (fixed_name_map::leaf_block_type::raw_size)*5);

Now the program is compiling fine and is accepting about 10^8 strings without any problem. also we can use myFixedMap like std::map itself. {for ex: myFixedMap[fixedString] = 10}



回答3:

If you are using C++11, then as an alternative to the FixedString class you could use std::array<char, MAX_KEY_LEN>. It is an STL layer on top of an ordinary fixed-size C array, implementing comparisons and iterators as you are used to from std::string, but it's a POD type, so STXXL should support it.

Alternatively, you can use serialization_sort in TPIE. It can sort elements of type std::pair<std::string, unsigned int> just fine, so if all you need is to insert everything in bulk and then access it in bulk, this will be sufficient for your case (and probably faster depending on the exact case).