As a novel approach to solving my challenge described here, I have put together the following:
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
diffs =[
"""- It contains a Title II provision that changes the age at which workers
compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA).""",
"""+ It contains a Title II provision that changes the age at which workers
compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA).""",
"""+ Here's a new paragraph I added for testing."""]
for s in diffs:
others = [i for i in diffs if i != s]
for j in others:
if similar(s, j) > 0.7:
print '"{}" and "{}" refer to the same sentence'.format(s, j)
print
diffs.remove(j)
else:
print '"{}" is a new sentence'.format(s)
The idea is to loop over the strings, and compare each with the others. If a given string is deemed to be similar to another, remove the other, otherwise the given string is deemed to be a unique string in the list.
Here's the output:
"- It contains a Title II provision that changes the age at which workers
compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA)." and "+ It contains a Title II provision that changes the age at which workers
compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA)." refer to the same sentence
"- It contains a Title II provision that changes the age at which workers
compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA)." is a new sentence
"+ Here's a new paragraph I added for testing." is a new sentence
So it's correctly detecting that the first two sentences are similar, and that the last is unique. The problem is it's then going back and deeming the first sentence to be unique (which it isn't, and it should not be returning to this sentence anyway).
Where's the flaw in my looping logic? Can this be achieved without nested for
s and removal of elements?