Data structure for matching sets-第3页回答

I have an application where I have a number of sets. A set might be
{4, 7, 12, 18}
unique numbers and all less than 50.

I then have several data items:
1 {1, 2, 4, 7, 8, 12, 18, 23, 29}
2 {3, 4, 6, 7, 15, 23, 34, 38}
3 {4, 7, 12, 18}
4 {1, 4, 7, 12, 13, 14, 15, 16, 17, 18}
5 {2, 4, 6, 7, 13, 15}

Data items 1, 3 and 4 match the set because they contain all items in the set.

I need to design a data structure that is super fast at identifying whether a data item ~~is a member of a set~~ includes all the members that are part of the set (so the data item is a superset of the set). My best estimates at the moment suggest that there will be fewer than 50,000 sets.

My current implementation has my sets and data as unsigned 64 bit integers and the sets stored in a list. Then to check a data item I iterate through the list doing a ((set & data) == set) comparison. It works and it's space efficient but it's slow (O(n)) and I'd be happy to trade some memory for some performance. Does anyone have any better ideas about how to organize this?

Edit: Thanks very much for all the answers. It looks like I need to provide some more information about the problem. I get the sets first and I then get the data items one by one. I need to check whether the data item is matches one of the sets.
The sets are very likely to be 'clumpy' for example for a given problem 1, 3 and 9 might be contained in 95% of sets; I can predict this to some degree in advance (but not well).

For those suggesting memoization: this is this the data structure for a memoized function. The sets represent general solutions that have already been computed and the data items are new inputs to the function. By matching a data item to a general solution I can avoid a whole lot of processing.

标签： c++ c algorithm data-structures

13条回答

smile是对你的礼貌

2楼-- · 2020-05-19 07:54

You can build a reverse index of "haystack" lists that contain each element:

std::set<int> needle;  // {4, 7, 12, 18}
std::vector<std::set<int>> haystacks;
// A list of your each of your data sets:
// 1 {1, 2, 4, 7, 8, 12, 18, 23, 29}
// 2 {3, 4, 6, 7, 15, 23, 34, 38}
// 3 {4, 7, 12, 18}
// 4 {1, 4, 7, 12, 13, 14, 15, 16, 17, 18}
// 5 {2, 4, 6, 7, 13, 

std::hash_map[int, set<int>>  element_haystacks;
// element_haystacks maps each integer to the sets that contain it
// (the key is the integers from the haystacks sets, and 
// the set values are the index into the 'haystacks' vector):
// 1 -> {1, 4}  Element 1 is in sets 1 and 4.
// 2 -> {1, 5}  Element 2 is in sets 2 and 4.
// 3 -> {2}  Element 3 is in set 3.
// 4 -> {1, 2, 3, 4, 5}  Element 4 is in sets 1 through 5.  
std::set<int> answer_sets;  // The list of haystack sets that contain your set.
for (set<int>::const_iterator it = needle.begin(); it != needle.end(); ++it) {
  const std::set<int> &new_answer = element_haystacks[i];
  std::set<int> existing_answer;
  std::swap(existing_answer, answer_sets);
  // Remove all answers that don't occur in the new element list.
  std::set_intersection(existing_answer.begin(), existing_answer.end(),
                        new_answer.begin(), new_answer.end(),
                        inserter(answer_sets, answer_sets.begin()));
  if (answer_sets.empty()) break;  // No matches :(
}

// answer_sets now lists the haystack_ids that include all your needle elements.
for (int i = 0; i < answer_sets.size(); ++i) {
  cout << "set: " << element_haystacks[answer_sets[i]];
}

If I'm not mistaken, this will have a max runtime of O(k*m), where is the avg number of sets that an integer belongs to and m is the avg size of the needle set (<50). Unfortunately, it'll have a significant memory overhead due to building the reverse mapping (element_haystacks).

I'm sure you could improve this a bit if you stored sorted vectors instead of sets and element_haystacks could be a 50 element vector instead of a hash_map.

0人赞添加讨论(0) 举报

上一页 1 2 3

Data structure for matching sets

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间