How do you parse a paragraph of text into sentence-第3页回答

How do you take paragraph or large amount of text and break it into sentences (perferably using Ruby) taking into account cases such as Mr. and Dr. and U.S.A? (Assuming you just put the sentences into an array of arrays)

UPDATE: One possible solution I thought of involves using a parts-of-speech tagger (POST) and a classifier to determine the end of a sentence:

Getting data from Mr. Jones felt the warm sun on his face as he stepped out onto the balcony of his summer home in Italy. He was happy to be alive.

CLASSIFIER Mr./PERSON Jones/PERSON felt/O the/O warm/O sun/O on/O his/O face/O as/O he/O stepped/O out/O onto/O the/O balcony/O of/O his/O summer/O home/O in/O Italy/LOCATION ./O He/O was/O happy/O to/O be/O alive/O ./O

POST Mr./NNP Jones/NNP felt/VBD the/DT warm/JJ sun/NN on/IN his/PRP$ face/NN as/IN he/PRP stepped/VBD out/RP onto/IN the/DT balcony/NN of/IN his/PRP$ summer/NN home/NN in/IN Italy./NNP He/PRP was/VBD happy/JJ to/TO be/VB alive./IN

Can we assume, since Italy is a location, the period is the valid end of the sentence? Since ending on "Mr." would have no other parts-of-speech, can we assume this is not a valid end-of-sentence period? Is this the best answer to the my question?

Thoughts?

标签： ruby text parsing split nlp

15条回答

啃猪蹄的小仙女

2楼-- · 2020-01-29 05:29

Unfortunately I'm not a ruby guy but maybe an example in perl will get you headed in the right direction. Using a non matching look behind for the ending punctuation then some special cases in a not behind followed by any amount of space followed by look ahead for a capital letter. I'm sure this isn't perfect but I hope it points you in the right direction. Not sure how you would know if U.S.A. is actually at the end of the sentence...

#!/usr/bin/perl

$string = "Mr. Thompson is from the U.S.A. and is 75 years old. Dr. Bob is a dentist. This is a string that contains several sentances. For example this is one. Followed by another. Can it deal with a question?  It sure can!";

my @sentances = split(/(?:(?<=\.|\!|\?)(?<!Mr\.|Dr\.)(?<!U\.S\.A\.)\s+(?=[A-Z]))/, $string);

for (@sentances) {
    print $_."\n";
}

0人赞添加讨论(0) 举报

我只想做你的唯一

3楼-- · 2020-01-29 05:31

Just to make it clear, there is no simple solution to that. This is topic of NLP research as a quick Google search shows.

However, it seems that there are some open source projects dealing with NLP supporting sentence detection, I found the following Java-based toolset:

openNLP

Additional comment: The problem of deciding where sentences begin and end is also called sentence boundary disambiguation (SBD) in natural language processing.

0人赞添加讨论(0) 举报

神经病院院长

4楼-- · 2020-01-29 05:33

Maybe try splitting it up by a period followed by a space followed by an uppercase letter? I'm not sure how to find uppercase letters, but that would be the pattern I'd start looking at.

Edit: Finding uppercase letters with Ruby.

Another Edit:

Check for sentence ending punctuation that follow words that don't start with uppercase letters.

0人赞添加讨论(0) 举报

上一页 1 2 3

How do you parse a paragraph of text into sentence

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间