Algorithm for dividing a set of strings into a min

2019-05-14 19:36发布

问题:

I have a large set of strings. I want to divide the strings into subsets such that:

  1. Each item in a subset shares 1 or more contiguous characters.
  2. The shared contiguous characters that define a subset are unique for the set of subsets (i.e. the shared characters are sufficient for defining a subset of strings that stands in a mutually exclusive relationship with other subsets).
  3. The subsets are roughly the same size.
  4. The resulting set of subsets is the minimal number of subsets needed that fit the above criteria.

For example given the following set of names:

Alan,Larry,Alfred,Barbara,Alphonse,Carl

I can divide this set into two subsets of equal size. Subset 1 defined by the contiguous characters "AL" would be

Alan, Alfred, Alphonse

Subset 2 defined by contiguous characters ar would be

Larry, Barbara, Carl.

I am looking for an algorithm that would do this for any arbitrary set of strings. The resulting set of subsets does not have to equal 2 but it should be the minimum set and the resulting subsets should be approximately equal.

Elliott

回答1:

Have a look at http://en.wikipedia.org/wiki/Suffix_array. It is possible that what you really want to do is to create a suffix array for each document, and them merge all the suffix arrays, with pointers back to the original versions, so that you can search the collection as one for a string by looking for it as a suffix in the array.



回答2:

This is tricky. I wonder if there's some higher purpose (like word indexing) or is this just an academic question?

It's not solvable in general, unless you accept the trivial solution of a single set defined by the empty sequence (which occurs in all words). For example, take the strings: a, ab, b.

  1. a must go into the set defined by a.
  2. b must go into the set defined by b.
  3. ab must go into both, because it contains both subsequences.

Will a similar example occur with the kind of words you're dealing with? I don't know. Perhaps you can deal with words mapping to more than one set, or you can have a tie-breaking system to determine where to put it.

Assuming this isn't a problem, the burrows-wheeler transform might help with finding good substrings.

Or how about something like:

  1. Generate all subsequences in the words.
  2. Build an interference graph of the subsequences, with an edge connecting two subsequences if they both occur in a single word.
  3. Colour the graph.
  4. Pick a representative subsequence for each colour.
  5. Make a set defined by each representative subsequence. If all words of that colour have that substring, put them all into that set.
  6. Otherwise, delete that substring from the graph, and repeat from step 3.

This algorithm is probably broken but it might give you some ideas about a solution (or at least some idea of the trickiness of your question ;-).