Given the following, i can find the longest common substring:
s1 = "this is a foo bar sentence ."
s2 = "what the foo bar blah blah black sheep is doing ?"
def longest_common_substring(s1, s2):
m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
longest, x_longest = 0, 0
for x in xrange(1, 1 + len(s1)):
for y in xrange(1, 1 + len(s2)):
if s1[x - 1] == s2[y - 1]:
m[x][y] = m[x - 1][y - 1] + 1
if m[x][y] > longest:
longest = m[x][y]
x_longest = x
else:
m[x][y] = 0
return s1[x_longest - longest: x_longest]
print longest_common_substring(s1, s2)
[out]:
foo bar
But how do i ensure that the longest common substring respect English word boundary and don't cut up a word? For example, the following sentences:
s1 = "this is a foo bar sentence ."
s2 = "what a kappa foo bar black sheep ?"
print longest_common_substring(s1, s2)
outputs the follow which is NOT desired since it breaks up the word kappa
from s2:
a foo bar
The desired output is still:
foo bar
I've tried also an ngram way of getting the longest common substring respecting word boundary but is there other way that deals with strings without calculating ngrams? (see answer)
Just add an acceptance condition in your code:
This is more of an interesting problem then I initially gave it credit for. When you think about it there are 4 possible outcomes.
Now your code takes care of the trivial case so we can leverage that; all that's left is to wrap your results in a few checks for the other cases. So what should these checks look like? Let's take your failure case:
So from a string
find
perspective we can find all those letters in that order in bothstring1
andstring2
, but if we separate everything around spaces into lists, and look for the lists in order onlystring1
will match.Now I'm primarily a C guy, so I want to write that in a function:
With this function we can check if either of the two strings do not contain all of the words of our result from
longest_common_substring(s1, s2)
in order. Perfect. So the last step is to combine these two functions and check for each of the 4 cases listed above:Now the function
longest_whole_substring(s1, s2)
will provide the longest whole substring not chopping off any words. Let’s just test it out in each of the cases:Trivial:
Word boundary at the beginning:
Word boundry at the end:
And a Word boundry at both ends:
Lookin' Good!