Data structure for matching sets

2020-05-19 07:10发布

I have an application where I have a number of sets. A set might be
{4, 7, 12, 18}
unique numbers and all less than 50.

I then have several data items:
1 {1, 2, 4, 7, 8, 12, 18, 23, 29}
2 {3, 4, 6, 7, 15, 23, 34, 38}
3 {4, 7, 12, 18}
4 {1, 4, 7, 12, 13, 14, 15, 16, 17, 18}
5 {2, 4, 6, 7, 13, 15}

Data items 1, 3 and 4 match the set because they contain all items in the set.

I need to design a data structure that is super fast at identifying whether a data item is a member of a set includes all the members that are part of the set (so the data item is a superset of the set). My best estimates at the moment suggest that there will be fewer than 50,000 sets.

My current implementation has my sets and data as unsigned 64 bit integers and the sets stored in a list. Then to check a data item I iterate through the list doing a ((set & data) == set) comparison. It works and it's space efficient but it's slow (O(n)) and I'd be happy to trade some memory for some performance. Does anyone have any better ideas about how to organize this?

Edit: Thanks very much for all the answers. It looks like I need to provide some more information about the problem. I get the sets first and I then get the data items one by one. I need to check whether the data item is matches one of the sets.
The sets are very likely to be 'clumpy' for example for a given problem 1, 3 and 9 might be contained in 95% of sets; I can predict this to some degree in advance (but not well).

For those suggesting memoization: this is this the data structure for a memoized function. The sets represent general solutions that have already been computed and the data items are new inputs to the function. By matching a data item to a general solution I can avoid a whole lot of processing.

13条回答
smile是对你的礼貌
2楼-- · 2020-05-19 07:54

You can build a reverse index of "haystack" lists that contain each element:

std::set<int> needle;  // {4, 7, 12, 18}
std::vector<std::set<int>> haystacks;
// A list of your each of your data sets:
// 1 {1, 2, 4, 7, 8, 12, 18, 23, 29}
// 2 {3, 4, 6, 7, 15, 23, 34, 38}
// 3 {4, 7, 12, 18}
// 4 {1, 4, 7, 12, 13, 14, 15, 16, 17, 18}
// 5 {2, 4, 6, 7, 13, 

std::hash_map[int, set<int>>  element_haystacks;
// element_haystacks maps each integer to the sets that contain it
// (the key is the integers from the haystacks sets, and 
// the set values are the index into the 'haystacks' vector):
// 1 -> {1, 4}  Element 1 is in sets 1 and 4.
// 2 -> {1, 5}  Element 2 is in sets 2 and 4.
// 3 -> {2}  Element 3 is in set 3.
// 4 -> {1, 2, 3, 4, 5}  Element 4 is in sets 1 through 5.  
std::set<int> answer_sets;  // The list of haystack sets that contain your set.
for (set<int>::const_iterator it = needle.begin(); it != needle.end(); ++it) {
  const std::set<int> &new_answer = element_haystacks[i];
  std::set<int> existing_answer;
  std::swap(existing_answer, answer_sets);
  // Remove all answers that don't occur in the new element list.
  std::set_intersection(existing_answer.begin(), existing_answer.end(),
                        new_answer.begin(), new_answer.end(),
                        inserter(answer_sets, answer_sets.begin()));
  if (answer_sets.empty()) break;  // No matches :(
}

// answer_sets now lists the haystack_ids that include all your needle elements.
for (int i = 0; i < answer_sets.size(); ++i) {
  cout << "set: " << element_haystacks[answer_sets[i]];
}

If I'm not mistaken, this will have a max runtime of O(k*m), where is the avg number of sets that an integer belongs to and m is the avg size of the needle set (<50). Unfortunately, it'll have a significant memory overhead due to building the reverse mapping (element_haystacks).

I'm sure you could improve this a bit if you stored sorted vectors instead of sets and element_haystacks could be a 50 element vector instead of a hash_map.

查看更多
登录 后发表回答