Consider this text paragraph
Conservation groups call the 20-year ban a crucial protection for an American icon. The mining industry and some Republican members of Congress say it is detrimental to Arizona's economy and the nation's energy independence."Despite significant pressure from the mining industry, the president and Secretary Salazar did not back down," said Jane Danowitz, U.S. public lands director for the Pew Environment Group.
In the above, its easy to split sentences over period(.) but it will lead to incorrect results when it processes the period in U.S.A. . Assume I have a list of abbreviations such as
String abbrev[] ={"u.s.a", "u.a.e", "u.k", "p.r.c","u.s.s.r", };
String regex= "\\.";
Pattern pattern = Pattern.compile(regex,Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(sx);
int beginIndex=0;
// Check all occurance
int index=0;
while (matcher.find()) {
System.out.print("Start index: " + matcher.start());
System.out.print(" End index: " + matcher.end() + " ");
String group=matcher.group();
System.out.println("group: " + group);
int dotIndex= group.indexOf(".");
String sub= sx.substring(beginIndex, matcher.start()+dotIndex);
beginIndex= matcher.start()+dotIndex;
System.out.println(sub);
}
I could do a brute force match with all the abbreviations around dotIndex. Is there a better approach ?
This problem cannot be solved by relying on regular expressions. To know whether a sentence ends at any given period is not simple. Abbreviations may or may not be the end of a sentence. Ellipses may be written as three periods (or, in some circumstances, four, depending on the prevailing style). Sentences sometimes end after a closing quotation mark that comes after a period that marks the end of the sentence (again depending on prevailing style).
You can use heuristics to get the answer right most of the time. But it's more of a statistical problem than a regex problem.
My best guess would be something like:
(?<!\.[a-zA-Z])\.(?![a-zA-Z]\.)
which would translate to:Then you can perform the replace from there.
Demo
This would require a lot more effort if you needed to catch period within quotes though, which is not accounted for in the above pattern.