I'm trying to break up a paragraph into sentences. Here is my code so far:
import java.util.*;
public class StringSplit {
public static void main(String args[]) throws Exception{
String testString = "The outcome of the negotiations is vital, because the current tax levels signed into law by President George W. Bush expire on Dec. 31. Unless Congress acts, tax rates on virtually all Americans who pay income taxes will rise on Jan. 1. That could affect economic growth and even holiday sales.";
String[] sentences = testString.split("[\\.\\!\\?]");
for (int i=0;i<sentences.length;i++){
System.out.println(i);
System.out.println(sentences[i]);
}
}
}
Two problems were found:
- The code splits anytime it comes to a period (".") symbol, even when it's actually one sentence. How do I prevent this?
- Each sentence that is split starts with a space. How do I delete the redundant space?
You can try to use the
java.text.BreakIterator
class for parsing sentences. For example:Trim it...
You can use the Class
SentenceSplitter
provided by this open source library here.The first one is a pretty hard problem to do properly, since you'd have to implement sentence detection. I suggest you don't do that, and just separate sentences with two blank lines after a punctuation mark. For example:
The second one can be solved using String.trim().
Example:
first Trim() Your String... and use this link
http://www.java-examples.com/java-string-split-example &http://www.rgagnon.com/javadetails/java-0438.html
and you can also use StringBuffer Class... just use this link i hope it will help you
Given the current input format, it will be difficult to split into sentences. You have to impose some rule additional rule to identify the end of a sentence, in addition to the period. For instance, this rule could be "a sentence should end with a period(.) and two spaces". (This is how the UNIX tool
grep
identifies sentences.