Is there an algorithm that can be used to extract simple sentences from paragraphs?
My ultimate goal is to later run another algorithm on the resulted simple sentence to determine the author's sentiment.
I've researched this from sources such as Chae-Deug Park but none discuss preparing simple sentences as training data.
Thanks in advance
I have just used openNLP for the same.
public static List<String> breakIntoSentencesOpenNlp(String paragraph) throws FileNotFoundException, IOException,
InvalidFormatException {
InputStream is = new FileInputStream("resources/models/en-sent.bin");
SentenceModel model = new SentenceModel(is);
SentenceDetectorME sdetector = new SentenceDetectorME(model);
String[] sentDetect = sdetector.sentDetect(paragraph);
is.close();
return Arrays.asList(sentDetect);
}
Example
//Failed at Hi.
paragraph = "Hi. How are you? This is Mike.";
SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));
//Failed at Door.Noone
paragraph = "Close the Door.Noone is out there";
SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));//not able to break on noone
paragraph = "Really!! I cant believe. Mr. Wilson can come any moment to receive mrs. watson.";
SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));
//Failed at dr.
paragraph = "Radhika, Mohan, and Shaik went to meet dr. Kashyap to raise fund for poor patients.";
SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));//breaking on dr.
paragraph = "This is how I tried to split a paragraph into a sentence. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.S. and numbers like 2.2. They all got splitted by the above code.";
SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));//breaking on dr.
paragraph = "www.thinkzarahatke.com is the second site I developed. You can send mail to admin@thinkzarahatke.com";
SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));
It failed only when there is a human mistake. Eg. "Dr." abbreviation should have capital D, and there is at least 1 space is expected between 2 sentences.
You can also achieve it using RE in following way;
public static List<String> breakIntoSentencesCustomRESplitter(String paragraph){
List<String> sentences = new ArrayList<String>();
Pattern re = Pattern.compile("[^.!?\\s][^.!?]*(?:[.!?](?!['\"]?\\s|$)[^.!?]*)*[.!?]?['\"]?(?=\\s|$)", Pattern.MULTILINE | Pattern.COMMENTS);
Matcher reMatcher = re.matcher(paragraph);
while (reMatcher.find()) {
sentences.add(reMatcher.group());
}
return sentences;
}
Example
paragraph = "Hi. How are you? This is Mike.";
SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));
//Failed at Door.Noone
paragraph = "Close the Door.Noone is out there";
SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));
//Failed at Mr., mrs.
paragraph = "Really!! I cant believe. Mr. Wilson can come any moment to receive mrs. watson.";
SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));
//Failed at dr.
paragraph = "Radhika, Mohan, and Shaik went to meet dr. Kashyap to raise fund for poor patients.";
SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));
//Failed at U.S.
paragraph = "This is how I tried to split a paragraph into a sentence. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.S. and numbers like 2.2. They all got splitted by the above code.";
SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));
paragraph = "www.thinkzarahatke.com is the second site I developed. You can send mail to admin@thinkzarahatke.com";
SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));
But errors are competitively high. Another way is using BreakIterator;
public static List<String> breakIntoSentencesBreakIterator(String paragraph){
List<String> sentences = new ArrayList<String>();
BreakIterator sentenceIterator =
BreakIterator.getSentenceInstance(Locale.ENGLISH);
BreakIterator sentenceInstance = sentenceIterator.getSentenceInstance();
sentenceInstance.setText(paragraph);
int end = sentenceInstance.last();
for (int start = sentenceInstance.previous();
start != BreakIterator.DONE;
end = start, start = sentenceInstance.previous()) {
sentences.add(paragraph.substring(start,end));
}
return sentences;
}
Example:
paragraph = "Hi. How are you? This is Mike.";
SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));
//Failed at Door.Noone
paragraph = "Close the Door.Noone is out there";
SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));
//Failed at Mr.
paragraph = "Really!! I cant believe. Mr. Wilson can come any moment to receive mrs. watson.";
SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));
//Failed at dr.
paragraph = "Radhika, Mohan, and Shaik went to meet dr. Kashyap to raise fund for poor patients.";
SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));
paragraph = "This is how I tried to split a paragraph into a sentence. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.S. and numbers like 2.2. They all got splitted by the above code.";
SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));
paragraph = "www.thinkzarahatke.com is the second site I developed. You can send mail to admin@thinkzarahatke.com";
SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));
Benchmarking:
- custom RE : 7 ms
- BreakIterator : 143 ms
- openNlp : 255 ms
Take a look at Apache OpenNLP, it has a Sentence Detector module. The documentation has examples of how to use it from command line and from API.