How to Split a paragraph into sentences separated

2019-07-07 10:43发布

Consider this text paragraph

Conservation groups call the 20-year ban a crucial protection for an American icon. The mining industry and some Republican members of Congress say it is detrimental to Arizona's economy and the nation's energy independence."Despite significant pressure from the mining industry, the president and Secretary Salazar did not back down," said Jane Danowitz, U.S. public lands director for the Pew Environment Group.

In the above, its easy to split sentences over period(.) but it will lead to incorrect results when it processes the period in U.S.A. . Assume I have a list of abbreviations such as

String abbrev[] ={"u.s.a", "u.a.e", "u.k", "p.r.c","u.s.s.r", };
String regex= "\\.";
Pattern pattern = Pattern.compile(regex,Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(sx);
int beginIndex=0;

// Check all occurance
int index=0;
while (matcher.find()) {
    System.out.print("Start index: " + matcher.start());
    System.out.print(" End index: " + matcher.end() + " ");

    String group=matcher.group();
    System.out.println("group: " + group);
    int dotIndex= group.indexOf(".");
    String sub= sx.substring(beginIndex, matcher.start()+dotIndex);
    beginIndex= matcher.start()+dotIndex;

    System.out.println(sub);
}            

I could do a brute force match with all the abbreviations around dotIndex. Is there a better approach ?

2条回答
相关推荐>>
2楼-- · 2019-07-07 11:25

This problem cannot be solved by relying on regular expressions. To know whether a sentence ends at any given period is not simple. Abbreviations may or may not be the end of a sentence. Ellipses may be written as three periods (or, in some circumstances, four, depending on the prevailing style). Sentences sometimes end after a closing quotation mark that comes after a period that marks the end of the sentence (again depending on prevailing style).

You can use heuristics to get the answer right most of the time. But it's more of a statistical problem than a regex problem.

查看更多
对你真心纯属浪费
3楼-- · 2019-07-07 11:32

My best guess would be something like: (?<!\.[a-zA-Z])\.(?![a-zA-Z]\.) which would translate to:

(?<!\.[a-zA-Z])    # can't be preceded by a period followed by a single letter
\.
(?![a-zA-Z]\.)     # nor can it be followed by a letter and another preiod

Then you can perform the replace from there.

Demo

This would require a lot more effort if you needed to catch period within quotes though, which is not accounted for in the above pattern.

查看更多
登录 后发表回答