String x=" i am going to the party at 6.00 in the evening. are you coming with me?";
if i have the above string, i need that to be broken to sentences by using sentence boundry punctuations(like . and ?)
but it should not split the sentence at 6 because of having an pointer there. is there a way to identify what is the correct sentence boundry place in java? i have tried using stringTokenizer in java.util pakage but it always break the sentence whenever it finds a pointer. Can someone suggest me a method to do this correctly?
This is the method which i have tried in tokenizing a text into sentences.
public static ArrayList<String> sentence_segmenter(String text) {
ArrayList<String> Sentences = new ArrayList<String>();
StringTokenizer st = new StringTokenizer(text, ".?!");
while (st.hasMoreTokens()) {
Sentences.add(st.nextToken());
}
return Sentences;
}
also i have a method to segement sentences into phrases, but here also when the program found comma(,) it splits the text. but i dont need to split it when there is a number like 60,000 with a comma in the middle. following is the method i am using to segment the phrases.
public static ArrayList<String> phrasesSegmenter(String text) {
ArrayList<String> phrases = new ArrayList<String>();
StringTokenizer st = new StringTokenizer(text, ",");
while (st.hasMoreTokens()) {
phrases.add(st.nextToken());
}
return phrases;
}
From the documentation of StringTokenizer
:
StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.
In case you'd use split, you can use any regular expression to split the text into sentences. You probably want something like any of ?!.
and either a space or end of text:
text.split("[?!.]($|\\s)")
Here is my Solution to the problem.
/** tries to decide if a there's a sentence-end in index i of a given text
* @param text
* @param i
* @return
*/
public static boolean isSentenceEnd(String text, int i) {
char c = text.charAt(i);
return isSentenceEndChar(c) && !isPeriodWord(text, i);
}
/**
* PeriodWords are words such as 'Dr.' or 'Mr.'
*
* @param text - the text to examoine.
* @param i - index of the priod '.' character
* @return
*/
private static String[] periodWords = { "Mr.", "Mrs.", "Ms.", "Prof.", "Dr.", "Gen.", "Rep.", "Sen.", "St.",
"Sr.", "Jr.", "Ph.", "Ph.D.", "M.D.", "B.A.", "M.A.", "D.D.", "D.D.S.",
"B.C.", "b.c.", "a.m.", "A.M.", "p.m.", "P.M.", "A.D.", "a.d.", "B.C.E.", "C.E.",
"i.e.", "etc.", "e.g.", "al."};
private static boolean isPeriodWord(String text, int i) {
if (i < 4) return true;
if (text.charAt(i-2) == ' ') return true; // one char words are definetly priodWords
String txt = text.substring(0, i);
for (String pword: periodWords) {
if (txt.endsWith(pword)) return true;
}
if (txt.matches("^.*\\d\\.$")) return true; // dates seperated with "." or numbers with fraction
return false;
}
private static final char[] sentenceEndChars = {'.', '?', '−'};
private static boolean isSentenceEndChar(char c) {
for (char sec : sentenceEndChars) {
if (c == sec) return true;
}
return false;
}