I want to find most longest common suffix in a set of strings to detect some potential important morpheme in my natural language process project.
Given frequency K>=2
,find the K-most common longest suffix in a list of strings S1,S2,S3...SN
To simplify the problem, here are some examples:
Input1:
K=2
S=["fireman","woman","businessman","policeman","businesswoman"]
Output1:
["man","eman","woman"]
Explanation1:
"man" occur 4 times, "eman" occur 2 times,"woman" occur 2 times
It would be appreciated if the output also tracked the frequency of each common suffix
{"man":4,"eman":2,"woman":2}
It is not guaranteed that each word will have at least one-length common suffix, see the example below.
Input2:
K=2
S=["fireman","woman","businessman","policeman","businesswoman","apple","pineapple","people"]
Output2:
["man","eman","woman","ple","apple"]
Explanation2:
"man" occur 4 times, "eman" occur 2 times,"woman" occur 2 times
"ple" occur 3 times, "apple" occur 2 times
Any efficient algorithm to solve this problem ?
I have referred to Suffix Tree and Generalised Suffix Tree algorithm but it seems not to fit this problem well.
(BTW, I am researching on a Chinese NLP project which means there are far more characters than English has)