How to find the subset with the greatest number of

2019-05-25 05:59发布

Let's say I have a number of 'known' sets:

1 {a, b, c, d, e}
2 {b, c, d, e}
3 {a, c, d}
4 {c, d}

I'd like a function which takes a set as an input, (for example {a, c, d, e}) and finds the set that has the highest number of elements, and no more other items in common. In other words, the subset with the greatest cardinality. The answer doesn't have to be a proper subset. The answer in this case would be {a, c, d}.

EDIT: the above example was wrong, now fixed.

I'm trying to find the absolute most efficient way of doing this.

(In the below, I am assuming that the cost of comparing two sets is O(1) for the sake of simplicity. That operation is outside my control so there's no point thinking about it. In truth it would be a function of the cardinality of the two sets being compared.)

Candiate 1:

Generate all subsets of the input, then iterate over the known sets and return the largest one that is a subset. The downside to this is that the complexity will be something like O(n! × m), where n is the cardinality of the input set and m is the number of 'known' subsets.

Candidate 1a (thanks @bratbrat):

Iterate over all 'known' sets and calculate the cardinatlity of the intersection, and take the one with the highest value. This would be O(n) where n is the number of subsets.

Candidate 2:

Create an inverse table and calculate the euclidean distance between the input and the known sets. This could be quite quick. I'm not clear how I could limit this to include only subsets without a subsequent O(n) filter.

Candidate 3:

Iterate over all known sets and compare against the input. The complexity would be O(n) where n is the number of known sets.

I have at my disposal the set functions built into Python and Redis.

None of these seems particularly great. Ideas? The number of sets may get large (around 100,000 at a guess).

3条回答
The star\"
2楼-- · 2019-05-25 06:33
  1. Take the union of the known sets. This becomes a dictionary of known elements.
  2. Sort the known elements by their value (they're integers, right). This defines a given integer's position in a bit string.
  3. Use the above to define bit strings for each of the known sets. This is a one time operation - the results should be stored to avoid recomputation.
  4. For an input set, run it through the same transform to obtain its bit string.
  5. To get the largest subset, run through the list of known bit strings, taking the intersection (logical and) with the input bit string. Count the '1' elements. Remember the largest one.

http://packages.python.org/bitstring

As mentioned in the comments, this can be paralleled up by subdividing the known sets and giving each thread its own subset to work on. Each thread serves up its best match and then the parent thread picks the best from the threads.

查看更多
我想做一个坏孩纸
3楼-- · 2019-05-25 06:37

There's no possible way to do this in less than O(n) time... just reading the input is O(n).

A couple ideas:

Sort the sets by size (biggest first), and search for the first set which is a subset of the input set. Once you find one, you don't have to examine the rest.

If the number of possible items which could be in the sets is limited, you could represent them by bit-vectors. Then you could calculate a lookup table to tell you whether a given set is a subset of the input set. (Walk down the bits for each input set under consideration, word by word, indexing each word into the appropriate table. If you find an entry telling you that it's not a subset, again, you can move on directly to the next input set.) Whether this would actually buy you performance, depends on the implementation language. I imagine it would be most effective in a language with primitive integral types, like C or Java.

查看更多
劫难
4楼-- · 2019-05-25 06:38

How many searches are you making? In case you are searching multiple input sets you should be able to pre-process all the known sets (perhaps as a tree structure) and your search time for each query would be in the order of your query set size.

Eg: Create a Trie structure with all the known sets. Make sure to sort each set before inserting them. For the query, follow the links that are in the set.

查看更多
登录 后发表回答