Given the following, i can find the longest common substring:
s1 = "this is a foo bar sentence ."
s2 = "what the foo bar blah blah black sheep is doing ?"
def longest_common_substring(s1, s2):
m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
longest, x_longest = 0, 0
for x in xrange(1, 1 + len(s1)):
for y in xrange(1, 1 + len(s2)):
if s1[x - 1] == s2[y - 1]:
m[x][y] = m[x - 1][y - 1] + 1
if m[x][y] > longest:
longest = m[x][y]
x_longest = x
else:
m[x][y] = 0
return s1[x_longest - longest: x_longest]
print longest_common_substring(s1, s2)
[out]:
foo bar
But how do i ensure that the longest common substring respect English word boundary and don't cut up a word? For example, the following sentences:
s1 = "this is a foo bar sentence ."
s2 = "what a kappa foo bar black sheep ?"
print longest_common_substring(s1, s2)
outputs the follow which is NOT desired since it breaks up the word kappa
from s2:
a foo bar
The desired output is still:
foo bar
I've tried also an ngram way of getting the longest common substring respecting word boundary but is there other way that deals with strings without calculating ngrams? (see answer)
This is too simple to understand. I used your code to do 75% of the job. I first split the sentence into words, then pass it to your function to get the largest common substring(in this case it will be longest consecutive words), so your function gives me ['foo', 'bar'], I join the elements of that array to produce the desired result.
Here is the online working copy for you to test and verify and fiddle with it.
http://repl.it/RU0/1
Edge cases
'.' and '?' are also treated as valid words as in your case if there is a space between last word and the punctuation mark. If you don't leave a space they will be counted as part of last word. In that case 'sheep' and 'sheep?' would not be same words anymore. Its up to you decide what to do with such characters, before calling such function. In that case
import re
s1 = re.sub('[.?]','', s1)
s2 = re.sub('[.?]','', s2)
and then continue as usual.
I did it recursively:
Just classify the two strings to 'shorter' and 'longer' before applying:
All you need to do is add checks for beginning and end of a word.
You then update
m
only for valid match ends.Like so:
My answer does not draw from any official sources, but just a simple observation: at least in my installation, there is a difference between the output of your LCS function as it is on the pair (s1, s2) and (s1, s3):
As you may notice, if complete words are matched, then also the surrounding whitespace is matched.
You can then modify the function before its output is returned, like this:
I am sure there are other edge cases, like the substring appearing at the end of the string, recursively calling the function either with
s1
ors2
, whether to trim theanswer
front or back, and others -- but at least in the cases you show, this simple modification does what you want:Do you think this direction is worth exploring?
Here's an ngram way:
One efficient method for finding longest common substrings is a suffix tree (see http://en.wikipedia.org/wiki/Suffix_tree and http://en.wikipedia.org/wiki/Longest_common_substring_problem). I don't see any reason you couldn't create a suffix tree using words instead of characters, in which case the longest-common-subsequence extracted from the tree would respect token boundaries. This approach would be particularly efficient if you want to find common substrings between one fixed string and a large number of other strings.
See the accepted answer to python: library for generalized suffix trees for a list of Python suffix tree implementations.