RegEx to exclude academic title

2020-07-17 15:36发布

I want split paragraph string into array of sentences. Of course I am using regular expression with character dot (.) to split the paragraph into sentences. The problem is academic title abbreviation in the sentences, every abbreviation is using dot (.). So my regex totally wrong to split the paragraph.

Here is an example of paragraph :

Meanwhile Rector of Bogor Agricultural University, Prof. Dr. Herry Suhardiyanto, in his remarks requested that the graduate students should keep on studying and will finalize their studies on time. Present in that general audience were the Deputy Dean of the Graduate School of Bogor Agricultural University, Dr.Dedi Jusadi, Secretary of the Graduate School for Doctoral Program of Bogor Agricultural University, Prof.Dr. Marimin.

Only using dot (.) as regex, I get :

Array (
[0] => Meanwhile Rector of Bogor Agricultural University, Prof
[1] => Dr
[2] => Herry Suhardiyanto, in his remarks requested that the graduate students should keep on studying and will finalize their studies on time
[3] => ...
)

And this actually I wanted :

Array (
[0] => Meanwhile Rector of Bogor Agricultural University, Prof. Dr. Herry Suhardiyanto, in his remarks requested that the graduate students should keep on studying and will finalize their studies on time
[1] => Present in  that general audience were  the Deputy Dean of the Graduate School of Bogor Agricultural University, Dr.Dedi Jusadi, Secretary of the Graduate School for Doctoral Program of Bogor Agricultural University, Prof.Dr. Marimin
)

标签: php regex text
2条回答
够拽才男人
2楼-- · 2020-07-17 16:10

This seems to work, but is a new PHP function vs. strictly RegEx -

$begin = array( 0=>'Meanwhile in geography,',
            1=>'Dr',
            2=>'Henry Suhardiyanto, in his remarks, stated that ',
            3=>'Dr',
            4=>'Prof',
            5=>'Jedi Dusadi was another ',
            6=>'Prof');

$exclusions = array("Dr", "Prof", "Mr", "Mrs");

foreach ($begin as $pos => $sentence) {
if (in_array($sentence, $exclusions)) {
    $begin[$pos+1] = $sentence . ". " . $begin[$pos+1];
    unset($begin[$pos]);
    array_values($begin);
    }
}    
查看更多
霸刀☆藐视天下
3楼-- · 2020-07-17 16:27

You could use Negative Lookbehinds:

((?<!Prof)(?<!Dr)(?<!Mr)(?<!Mrs)(?<!Ms))\. add more if needed

Explained demo here: http://regex101.com/r/xQ3xF9

And the code could look like this:

$text="Meanwhile Rector of Bogor Agricultural University, Prof. Dr. Herry Suhardiyanto, in his remarks about Mr. John requested that the graduate students should keep on studying and will finalize their studies on time. Present in that general audience were Mrs. Peterson of the Graduate School of Bogor Agricultural University, Dr.Dedi Jusadi, Secretary of the Graduate School for Doctoral Program of Bogor Agricultural University, Prof.Dr. Marimin.";

$titles=array('(?<!Prof)', '(?<!Dr)', '(?<!Mr)', '(?<!Mrs)', '(?<!Ms)');
$sentences=preg_split('/('.implode('',$titles).')\./',$text);
print_r($sentences);
查看更多
登录 后发表回答