-->

Find K-most longest common suffix in a set of stri

2019-08-24 08:15发布

问题:

I want to find most longest common suffix in a set of strings to detect some potential important morpheme in my natural language process project.

Given frequency K>=2,find the K-most common longest suffix in a list of strings S1,S2,S3...SN

To simplify the problem, here are some examples:

Input1:

 K=2
 S=["fireman","woman","businessman","policeman","businesswoman"]

Output1:

 ["man","eman","woman"]

Explanation1:

"man" occur 4 times, "eman" occur 2 times,"woman" occur 2 times


It would be appreciated if the output also tracked the frequency of each common suffix

{"man":4,"eman":2,"woman":2}

It is not guaranteed that each word will have at least one-length common suffix, see the example below.

Input2:

 K=2
 S=["fireman","woman","businessman","policeman","businesswoman","apple","pineapple","people"]

Output2:

 ["man","eman","woman","ple","apple"]

Explanation2:

"man" occur 4 times, "eman" occur 2 times,"woman" occur 2 times

"ple" occur 3 times, "apple" occur 2 times


Any efficient algorithm to solve this problem ?

I have referred to Suffix Tree and Generalised Suffix Tree algorithm but it seems not to fit this problem well.

(BTW, I am researching on a Chinese NLP project which means there are far more characters than English has)

回答1:

If you want to be really efficient this paper could help. The algorithm gives you the most frequent (longest) substrings from a set of strings in O(n log n). The basic idea is the following:

  1. Concatenate all strings with whitespace between them (these are used as seperators later) in order to from a single long string

  2. Build a suffix array SA over the long string

  3. For each pair of adjacent entries in the suffix array, check how many of the starting characters are equal of the two suffixes that the entries refer to (you should stop when you reach a whitespace). Put the count of equal characters in an additional array (called LCP). Each entry LCP[i] refers to the count of equal characters of the string at position SA[i-1] and SA[i] in the long string.

  4. Go over the LCP an find successive entries with a count larger than K.

Example from the paper for3 strings

s1= bbabab

s2 = abacac

s3 = bbaaa