I am using nltk to split a text into sentence units. However, I need the sentences that contain quotes to be extracted as a single unit. Right now each sentence, even if it is within a quote is getting extracted as a separate part.
This is an example of something that I am trying to extract as a single unit:
"This is a sentence. This is also a sentence," said the cat.
Right now I have this code:
import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
text = 'This is a sentence. This is also a sentence," said the cat.'
print '\n-----\n'.join(tokenizer.tokenize(text, realign_boundaries=True))
This works pretty well, but I want to maintain sentences with quotes in them even when the quotes themselves contain multiple sentences.
The code above produces:
This is a sentence.
-----
This is also a sentence," said the cat.
I am trying to get that whole text extracted as a single unit:
"This is a sentence. This is also a sentence," said the cat.
Is there an easy way to do this with nltk or should I use regex instead? I was impressed with how easy it was to get started with nltk, but am stuck now.