I have a large set of strings. I want to divide the strings into subsets such that:
- Each item in a subset shares 1 or more contiguous characters.
- The shared contiguous characters that define a subset are unique for the set of subsets (i.e. the shared characters are sufficient for defining a subset of strings that stands in a mutually exclusive relationship with other subsets).
- The subsets are roughly the same size.
- The resulting set of subsets is the minimal number of subsets needed that fit the above criteria.
For example given the following set of names:
Alan,Larry,Alfred,Barbara,Alphonse,Carl
I can divide this set into two subsets of equal size. Subset 1 defined by the contiguous characters "AL" would be
Alan, Alfred, Alphonse
Subset 2 defined by contiguous characters ar would be
Larry, Barbara, Carl.
I am looking for an algorithm that would do this for any arbitrary set of strings. The resulting set of subsets does not have to equal 2 but it should be the minimum set and the resulting subsets should be approximately equal.
Elliott
Have a look at http://en.wikipedia.org/wiki/Suffix_array. It is possible that what you really want to do is to create a suffix array for each document, and them merge all the suffix arrays, with pointers back to the original versions, so that you can search the collection as one for a string by looking for it as a suffix in the array.
This is tricky. I wonder if there's some higher purpose (like word indexing) or is this just an academic question?
It's not solvable in general, unless you accept the trivial solution of a single set defined by the empty sequence (which occurs in all words). For example, take the strings: a
, ab
, b
.
a
must go into the set defined by a
.
b
must go into the set defined by b
.
ab
must go into both, because it contains both subsequences.
Will a similar example occur with the kind of words you're dealing with? I don't know. Perhaps you can deal with words mapping to more than one set, or you can have a tie-breaking system to determine where to put it.
Assuming this isn't a problem, the burrows-wheeler transform might help with finding good substrings.
Or how about something like:
- Generate all subsequences in the words.
- Build an interference graph of the subsequences, with an edge connecting two subsequences if they both occur in a single word.
- Colour the graph.
- Pick a representative subsequence for each colour.
- Make a set defined by each representative subsequence. If all words of that colour have that substring, put them all into that set.
- Otherwise, delete that substring from the graph, and repeat from step 3.
This algorithm is probably broken but it might give you some ideas about a solution (or at least some idea of the trickiness of your question ;-).