Code:
from nltk.tokenize import sent_tokenize
pprint(sent_tokenize(unidecode(text)))
Output:
[After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker.',
'Finally they pushed you out of the cold emergency room.',
'I failed to protect you.',
'"Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.',]
Input:
After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."
Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.
Quotes should be included in previous sentence. Instead of " Li.
It fails at ."
How to fix this?
Edit: Explaining the extraction of text.
html = open(path, "r").read() #reads html code
article = extractor.extract(raw_html=html) #extracts content
text = unidecode(article.cleaned_text) #changes encoding
Here, article.cleaned_text is in unicode. The idea behind using this to change characters “ to ".
Solutions @alvas Incorrect Result:
['After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker.',
'Finally they pushed you out of the cold emergency room.',
'I failed to protect you.',
'"',
'Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.'
]
Edit2: (Updated) nltk and python version
python -c "import nltk; print nltk.__version__"
3.0.4
python -V
Python 2.7.9
I'm not sure what is the desired output but I think you might need some paragraph segmentation before
nltk.sent_tokenize
, i.e.:Possibly, you might want strings within the double quotes too, if so you could try this:
Or maybe you would need this:
When reading from file, try to use the
io
package:And with the paragraph and quote extraction hacks:
For the magic to concatenate the pre-quote sentence with the quotes (don't blink, it looks quite the same as above):
The problem with the above code is that it is limited to sentences like:
And cannot handle:
Just to make sure, my python/nltk versions are:
Beyond the computational aspect of the text processing, there's something subtly different about the grammar of the text in the question.
The fact that a quote is followed by a semi-colon
:
is untypical of the traditional English grammar. This might have been popularized in the Chinese news because in Chinese:In traditional English in a very prescriptive grammatical sense, it would have been:
And a post-quotation statement would have been signalled by an ending comma instead of a fullstop, e.g.: