Longest Common Substring without cutting a word- p

2019-05-12 23:45发布

Given the following, i can find the longest common substring:

s1 = "this is a foo bar sentence ."
s2 = "what the foo bar blah blah black sheep is doing ?"

def longest_common_substring(s1, s2):
  m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
  longest, x_longest = 0, 0
  for x in xrange(1, 1 + len(s1)):
    for y in xrange(1, 1 + len(s2)):
      if s1[x - 1] == s2[y - 1]:
        m[x][y] = m[x - 1][y - 1] + 1
        if m[x][y] > longest:
          longest = m[x][y]
          x_longest = x
      else:
        m[x][y] = 0
  return s1[x_longest - longest: x_longest]

print longest_common_substring(s1, s2)

[out]:

foo bar

But how do i ensure that the longest common substring respect English word boundary and don't cut up a word? For example, the following sentences:

s1 = "this is a foo bar sentence ."
s2 = "what a kappa foo bar black sheep ?"
print longest_common_substring(s1, s2)

outputs the follow which is NOT desired since it breaks up the word kappa from s2:

a foo bar

The desired output is still:

foo bar

I've tried also an ngram way of getting the longest common substring respecting word boundary but is there other way that deals with strings without calculating ngrams? (see answer)

8条回答
迷人小祖宗
2楼-- · 2019-05-13 00:46

Just add an acceptance condition in your code:

s1 = "this is a foo bar sentence ."
s2 = "what the foo bar blah blah black sheep is doing ?"

def longest_common_substring(s1, s2):
  m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
  longest, x_longest = 0, 0
  for x in xrange(1, 1 + len(s1)):
    for y in xrange(1, 1 + len(s2)):
      if s1[x - 1] == s2[y - 1]:
        m[x][y] = m[x - 1][y - 1] + 1
        if m[x][y] > longest and word_aligned(x, y, m[x][y]):  # acceptance condition
          longest = m[x][y]
          x_longest = x
      else:
        m[x][y] = 0
  return s1[x_longest - longest: x_longest]

def word_aligned(x, y, length):
    """check that a match starting at s1[x - 1] and s2[y - 1] is aligned on a word boundary"""
    # check start of match in s1
    if s1[x - 1].isspace():
        # match doesn't start with a character, reject
        return False
    if x - 2 > 0 and not s1[x - 2].isspace():
        # char before match is not start of line or space, reject
        return False
    # check start of match in s2
    ... same as above ...
    # check end of match in s1
    ... your code is a bit hard for me follow, what is end of match? ...
    # check end of match in s2
    ... same as above ...
    return True

print longest_common_substring(s1, s2)
查看更多
干净又极端
3楼-- · 2019-05-13 00:49

This is more of an interesting problem then I initially gave it credit for. When you think about it there are 4 possible outcomes.

  1. The trivial case, the whole string is matched no boundaries (your first example)
  2. Cross a word boundary at the beginning (your second example)
  3. Cross a word boundary at the end
  4. Have a word boundary at each end

Now your code takes care of the trivial case so we can leverage that; all that's left is to wrap your results in a few checks for the other cases. So what should these checks look like? Let's take your failure case:

string 1 = "this is a foo bar sentence ."
string 2 = "what a kappa foo bar black sheep ?"
output string = "a foo bar"

So from a string find perspective we can find all those letters in that order in both string1 and string2, but if we separate everything around spaces into lists, and look for the lists in order only string1 will match.

Now I'm primarily a C guy, so I want to write that in a function:

def full_string(str1, str2, chkstr):
  l1 = str1.split()
  l2 = str2.split()
  chkl = chkstr.split()
  return (any(l1[i:i+len(chkl)]==chkl for i in xrange(len(l1)-len(chkl)+1)) and
          any(l2[i:i+len(chkl)]==chkl for i in xrange(len(l2)-len(chkl)+1)))

With this function we can check if either of the two strings do not contain all of the words of our result from longest_common_substring(s1, s2) in order. Perfect. So the last step is to combine these two functions and check for each of the 4 cases listed above:

def longest_whole_substring(s1, s2):
  subs = longest_common_substring(s1, s2)
  if not full_string(s1, s2, subs):
    if full_string(s1, s2, ' '.join(subs.split()[1:])):
      subs = ' '.join(subs.split()[1:])
    elif full_string(s1, s2, ' '.join(subs.split()[:-1])):
      subs = ' '.join(subs.split()[:-1])
    else:
      subs = ' '.join(subs.split()[1:-1])
  return subs

Now the function longest_whole_substring(s1, s2) will provide the longest whole substring not chopping off any words. Let’s just test it out in each of the cases:

Trivial:

>>> a = 'this is a foo bar bar foo string'
>>> b = 'foo bar'
>>> 
>>> longest_whole_substring(a,b)
'foo bar'

Word boundary at the beginning:

>>> b = 's a foo bar'
>>> 
>>> longest_whole_substring(a,b)
'a foo bar '

Word boundry at the end:

>>> b = 'foo bar f'
>>> 
>>> longest_whole_substring(a,b)
'foo bar'

And a Word boundry at both ends:

>>> b = 's a foo bar f'
>>> 
>>> longest_whole_substring(a,b)
'a foo bar'

Lookin' Good!

查看更多
登录 后发表回答