I would like to divide a text into sentences in PHP. I'm currently using a regex, which brings ~95% accuracy and would like to improve by using a better approach. I've seen NLP tools that do that in Perl, Java, and C but didn't see anything that fits PHP. Do you know of such a tool?
相关问题
- Views base64 encoded blob in HTML with PHP
- Laravel Option Select - Default Issue
- PHP Recursively File Folder Scan Sorted by Modific
- Can php detect if javascript is on or not?
- Using similar_text and strpos together
I was using this regex:
Won't work on a sentence starting with a number, but should have very few false positives as well. Of course what you are doing matters as well. My program now uses
because I decided speed was more important than accuracy.
@ridgerunner I wrote your PHP code in C #
I get like 2 sentences as result :
The correct result should be the sentence : Mr. J. Dujardin régle sa T.V.A. en esp. uniquement
and with our test paragraph
The result is
C# code :
Try these -
https://stackoverflow.com/questions/366284/natural-language-identification-in-php
http://pear.php.net/package/Text_LanguageDetect
Slight improvement on someone else's work:
An enhanced regex solution
Assuming you do care about handling:
Mr.
andMrs.
etc. abbreviations, then the following single regex solution works pretty well:Note that you can easily add or take away abbreviations from the expression. Given the following test paragraph:
Here is the output from the script:
Sentence[1] = [This is sentence one.]
Sentence[2] = [Sentence two!]
Sentence[3] = [Sentence three?]
Sentence[4] = [Sentence "four".]
Sentence[5] = [Sentence "five"!]
Sentence[6] = [Sentence "six"?]
Sentence[7] = [Sentence "seven."]
Sentence[8] = [Sentence 'eight!']
Sentence[9] = [Dr. Jones said: "Mrs. Smith you have a lovely daughter!"]
Sentence[10] = [The T.V.A. is a big project!]
The essential regex solution
The author of the question commented that the above solution "overlooks many options" and is not generic enough. I'm not sure what that means, but the essence of the above expression is about as clean and simple as you can get. Here it is:
Note that both solutions correctly identify sentences ending with a quotation mark after the ending punctuation. If you don't care about matching sentences ending in a quotation mark the regex can be simplified to just:
/(?<=[.!?])\s+(?=\S)/
.Edit: 20130820_1000 Added
T.V.A.
(another punctuated word to be ignored) to regex and test string. (to answer PapyRef's comment question)Edit: 20130820_1800 Tidied and renamed regex and added shebang. Also fixed regexes to prevent splitting text on trailing whitespace.
Build a list of abbreviations like this
Compile them into a an expression
Last run this preg_split to break into sentences.
And if you're processing HTML, watch for tags getting deleted which eliminate the space between sentences.
<p></p>
If you havesituations.Like
thiswhere.They
stick together it becomes immensely more difficult to parse.