Data structure for matching sets

I have an application where I have a number of sets. A set might be
{4, 7, 12, 18}
unique numbers and all less than 50.

I then have several data items:
1 {1, 2, 4, 7, 8, 12, 18, 23, 29}
2 {3, 4, 6, 7, 15, 23, 34, 38}
3 {4, 7, 12, 18}
4 {1, 4, 7, 12, 13, 14, 15, 16, 17, 18}
5 {2, 4, 6, 7, 13, 15}

Data items 1, 3 and 4 match the set because they contain all items in the set.

I need to design a data structure that is super fast at identifying whether a data item ~~is a member of a set~~ includes all the members that are part of the set (so the data item is a superset of the set). My best estimates at the moment suggest that there will be fewer than 50,000 sets.

My current implementation has my sets and data as unsigned 64 bit integers and the sets stored in a list. Then to check a data item I iterate through the list doing a ((set & data) == set) comparison. It works and it's space efficient but it's slow (O(n)) and I'd be happy to trade some memory for some performance. Does anyone have any better ideas about how to organize this?

Edit: Thanks very much for all the answers. It looks like I need to provide some more information about the problem. I get the sets first and I then get the data items one by one. I need to check whether the data item is matches one of the sets.
The sets are very likely to be 'clumpy' for example for a given problem 1, 3 and 9 might be contained in 95% of sets; I can predict this to some degree in advance (but not well).

For those suggesting memoization: this is this the data structure for a memoized function. The sets represent general solutions that have already been computed and the data items are new inputs to the function. By matching a data item to a general solution I can avoid a whole lot of processing.

标签： c++ c algorithm data-structures

13条回答

我想做一个坏孩纸

2楼-- · 2020-05-19 07:32

I'm surprised no one has mentioned that the STL contains an algorithm to handle this sort of thing for you. Hence, you should use includes. As it describes it performs at most 2*(N+M)-1 comparisons for a worst case performance of O(M+N).

Hence:

bool isContained = includes( myVector.begin(), myVector.end(), another.begin(), another.end() );

if you're needing O( log N ) time, I'll have to yield to the other responders.

0人赞添加讨论(0) 举报

Evening l夕情丶

3楼-- · 2020-05-19 07:33

This is not a real answer more an observation: this problem looks like it could be efficiently parallellized or even distributed, which would at least reduce the running time to O(n / number of cores)

0人赞添加讨论(0) 举报

劳资没心，怎么记你

4楼-- · 2020-05-19 07:36

The index of the sets that match the search criterion resemble the sets themselves. Instead of having the unique indexes less than 50, we have unique indexes less than 50000. Since you don't mind using a bit of memory, you can precompute matching sets in a 50 element array of 50000 bit integers. Then you index into the precomputed matches and basically just do your ((set & data) == set) but on the 50000 bit numbers which represent the matching sets. Here's what I mean.

#include <iostream>

enum
{
    max_sets = 50000, // should be >= 64
    num_boxes = max_sets / 64 + 1,
    max_entry = 50
};

uint64_t sets_containing[max_entry][num_boxes];

#define _(x) (uint64_t(1) << x)

uint64_t sets[] =
{
    _(1) | _(2) | _(4) | _(7) | _(8) | _(12) | _(18) | _(23) | _(29),
    _(3) | _(4) | _(6) | _(7) | _(15) | _(23) | _(34) | _(38),
    _(4) | _(7) | _(12) | _(18),
    _(1) | _(4) | _(7) | _(12) | _(13) | _(14) | _(15) | _(16) | _(17) | _(18),
    _(2) | _(4) | _(6) | _(7) | _(13) | _(15),
    0,
};

void big_and_equals(uint64_t lhs[num_boxes], uint64_t rhs[num_boxes])
{
    static int comparison_counter = 0;
    for (int i = 0; i < num_boxes; ++i, ++comparison_counter)
    {
        lhs[i] &= rhs[i];
    }
    std::cout
        << "performed "
        << comparison_counter
        << " comparisons"
        << std::endl;
}

int main()
{
    // Precompute matches
    memset(sets_containing, 0, sizeof(uint64_t) * max_entry * num_boxes);

    int set_number = 0;
    for (uint64_t* p = &sets[0]; *p; ++p, ++set_number)
    {
        int entry = 0;
        for (uint64_t set = *p; set; set >>= 1, ++entry)
        {
            if (set & 1)
            {
                std::cout
                    << "sets_containing["
                    << entry
                    << "]["
                    << (set_number / 64)
                    << "] gets bit "
                    << set_number % 64
                    << std::endl;

                uint64_t& flag_location =
                    sets_containing[entry][set_number / 64];

                flag_location |= _(set_number % 64);
            }
        }
    }

    // Perform search for a key
    int key[] = {4, 7, 12, 18};
    uint64_t answer[num_boxes];
    memset(answer, 0xff, sizeof(uint64_t) * num_boxes);

    for (int i = 0; i < sizeof(key) / sizeof(key[0]); ++i)
    {
        big_and_equals(answer, sets_containing[key[i]]);
    }

    // Display the matches
    for (int set_number = 0; set_number < max_sets; ++set_number)
    {
        if (answer[set_number / 64] & _(set_number % 64))
        {
            std::cout
                << "set "
                << set_number
                << " matches"
                << std::endl;
        }
    }

    return 0;
}

Running this program yields:

sets_containing[1][0] gets bit 0
sets_containing[2][0] gets bit 0
sets_containing[4][0] gets bit 0
sets_containing[7][0] gets bit 0
sets_containing[8][0] gets bit 0
sets_containing[12][0] gets bit 0
sets_containing[18][0] gets bit 0
sets_containing[23][0] gets bit 0
sets_containing[29][0] gets bit 0
sets_containing[3][0] gets bit 1
sets_containing[4][0] gets bit 1
sets_containing[6][0] gets bit 1
sets_containing[7][0] gets bit 1
sets_containing[15][0] gets bit 1
sets_containing[23][0] gets bit 1
sets_containing[34][0] gets bit 1
sets_containing[38][0] gets bit 1
sets_containing[4][0] gets bit 2
sets_containing[7][0] gets bit 2
sets_containing[12][0] gets bit 2
sets_containing[18][0] gets bit 2
sets_containing[1][0] gets bit 3
sets_containing[4][0] gets bit 3
sets_containing[7][0] gets bit 3
sets_containing[12][0] gets bit 3
sets_containing[13][0] gets bit 3
sets_containing[14][0] gets bit 3
sets_containing[15][0] gets bit 3
sets_containing[16][0] gets bit 3
sets_containing[17][0] gets bit 3
sets_containing[18][0] gets bit 3
sets_containing[2][0] gets bit 4
sets_containing[4][0] gets bit 4
sets_containing[6][0] gets bit 4
sets_containing[7][0] gets bit 4
sets_containing[13][0] gets bit 4
sets_containing[15][0] gets bit 4
performed 782 comparisons
performed 1564 comparisons
performed 2346 comparisons
performed 3128 comparisons
set 0 matches
set 2 matches
set 3 matches

3128 uint64_t comparisons beats 50000 comparisons so you win. Even in the worst case, which would be a key which has all 50 items, you only have to do num_boxes * max_entry comparisons which in this case is 39100. Still better than 50000.

0人赞添加讨论(0) 举报

forever°为你锁心

5楼-- · 2020-05-19 07:38

Since the numbers are less than 50, you could build a one-to-one hash using a 64-bit integer and then use bitwise operations to test the sets in O(1) time. The hash creation would also be O(1). I think either an XOR followed by a test for zero or an AND followed by a test for equality would work. (If I understood the problem correctly.)

0人赞添加讨论(0) 举报

放我归山

6楼-- · 2020-05-19 07:40

You could use inverted index of your data items. For your example

1 {1, 2, 4, 7, 8, 12, 18, 23, 29}
2 {3, 4, 6, 7, 15, 23, 34, 38}
3 {4, 7, 12, 18}
4 {1, 4, 7, 12, 13, 14, 15, 16, 17, 18}
5 {2, 4, 6, 7, 13, 15}

the inverted index will be

1: {1, 4}
2: {1, 5}
3: {2}
4: {1, 2, 3, 4, 5}
5: {}
6: {2, 5}
...

So, for any particular set {x_0, x_1, ..., x_i} you need to intersect sets for x_0, x_1 and others. For example, for the set {2,3,4} you need to intersect {1,5} with {2} and with {1,2,3,4,5}. Because you could have all your sets in inverted index sorted, you could intersect sets in min of lengths of sets that are to be intersected.

Here could be an issue, if you have very 'popular' items (as 4 in our example) with huge index set.

Some words about intersecting. You could use sorted lists in inverted index, and intersect sets in pairs (in increasing length order). Or as you have no more than 50K items, you could use compressed bit sets (about 6Kb for every number, fewer for sparse bit sets, about 50 numbers, not so greedily), and intersect bit sets bitwise. For sparse bit sets that will be efficiently, I think.

0人赞添加讨论(0) 举报

虎瘦雄心在

7楼-- · 2020-05-19 07:43

I see another solution which is dual to yours (i.e., testing a data item against every set) and that is using a binary tree where each node tests whether a specific item is included or not.

For instance if you had the sets A = { 2, 3 } and B = { 4 } and C = { 1, 3 } you'd have the following tree

                      _NOT_HAVE_[1]___HAVE____
                      |                      |            
                _____[2]_____          _____[2]_____
                |           |          |           |
             __[3]__     __[3]__    __[3]__     __[3]__
             |     |     |     |    |     |     |     |
            [4]   [4]   [4]   [4]  [4]   [4]   [4]   [4]
            / \   / \   / \   / \  / \   / \   / \   / \
           .   B .   B .   B .   B    B C   B A   A A   A
                                            C     B C   B
                                                        C

After making the tree, you'd simply need to make 50 comparisons---or how ever many items you can have in a set.

For instance, for { 1, 4 }, you branch through the tree : right (the set has 1), left (doesn't have 2), left, right, and you get [ B ], meaning only set B is included in { 1, 4 }.

This is basically called a "Binary Decision Diagram". If you are offended by the redundancy in the nodes (as you should be, because 2^50 is a lot of nodes...) then you should consider the reduced form, which is called a "Reduced, Ordered Binary Decision Diagram" and is a commonly used data-structure. In this version, nodes are merged when they are redundant, and you no longer have a binary tree, but a directed acyclic graph.

The Wikipedia page on ROBBDs can provide you with more information, as well as links to libraries which implement this data-structure for various languages.

0人赞添加讨论(0) 举报

1 2 3 下一页

Data structure for matching sets

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间