Downloading this page and making a very minor edit to it, changing the first 65 in this paragraph to 68:
I then run it through the following code to pull out the diffs.
import bs4
from bs4 import BeautifulSoup
import urllib2
import lxml.html as lh
url = 'https://secure.ssa.gov/apps10/reference.nsf/links/02092016062645AM'
response = urllib2.urlopen(url)
content = response.read() # get response as list of lines
root = lh.fromstring(content)
section1 = root.xpath("//div[@class = 'column-12']")[0]
section1_text = section1.text_content()
url2 = 'file:///Users/Pyderman/repos/02092016062645AM-modified.html'
response2 = urllib2.urlopen(url2)
content2 = response2.read() # get response as list of lines
root2 = lh.fromstring(content2)
section2 = root2.xpath("//div[@class = 'column-12']")[0]
section2_text = section2.text_content()
d = difflib.Differ()
soup = bs4.BeautifulSoup(unicode(section1_text))
soup2= bs4.BeautifulSoup(unicode(section2_text))
from nltk import sent_tokenize
sentences = [sentence for string in soup.stripped_strings for sentence in sent_tokenize(string)]
sentences2 = [sentence for string in soup2.stripped_strings for sentence in sent_tokenize(string)]
diff = d.compare(sentences, sentences2)
changes = [change for change in diff if change.startswith('-') or change.startswith('+')]
for change in changes:
print(change)
Printing the changes gives:
- It contains a Title II provision that changes the age at which workers compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA).
+ It contains a Title II provision that changes the age at which workers compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA).
So a change gets marked with a +, whether it's a new addition (a brand new full sentence also gets marked with a +) or a minor change to an existing sentence. As it stands then, unless my program does some additional processing, it will think that a new sentence was added and another one was removed.
How can we take advantage of the fact that what difflib
sees as the apparently 'removed' sentence and the apparently 'added' sentence are very similar, in order to determine that we are in fact dealing with an in-place change to an existing sentence?
NOTE: The solution will need to be able to process potentially several changes in a single page, so it won't be sufficient to apply something like if sentence1 very similar to sentence 2: then it's a modification
, since there will be several diffs to compare and contrast.