How do you parse a paragraph of text into sentence

2020-01-29 05:00发布

How do you take paragraph or large amount of text and break it into sentences (perferably using Ruby) taking into account cases such as Mr. and Dr. and U.S.A? (Assuming you just put the sentences into an array of arrays)

UPDATE: One possible solution I thought of involves using a parts-of-speech tagger (POST) and a classifier to determine the end of a sentence:

Getting data from Mr. Jones felt the warm sun on his face as he stepped out onto the balcony of his summer home in Italy. He was happy to be alive.

CLASSIFIER Mr./PERSON Jones/PERSON felt/O the/O warm/O sun/O on/O his/O face/O as/O he/O stepped/O out/O onto/O the/O balcony/O of/O his/O summer/O home/O in/O Italy/LOCATION ./O He/O was/O happy/O to/O be/O alive/O ./O

POST Mr./NNP Jones/NNP felt/VBD the/DT warm/JJ sun/NN on/IN his/PRP$ face/NN as/IN he/PRP stepped/VBD out/RP onto/IN the/DT balcony/NN of/IN his/PRP$ summer/NN home/NN in/IN Italy./NNP He/PRP was/VBD happy/JJ to/TO be/VB alive./IN

Can we assume, since Italy is a location, the period is the valid end of the sentence? Since ending on "Mr." would have no other parts-of-speech, can we assume this is not a valid end-of-sentence period? Is this the best answer to the my question?

Thoughts?

15条回答
爱情/是我丢掉的垃圾
2楼-- · 2020-01-29 05:09

Breaking on a period followed by a space and a capitalized letter wouldn't fly for titles like "Mr. Brown."

The periods make things difficult, but an easy case to handle is exclamation points and question marks. However, there are cases that would make this not work. i.e. the corporate name of Yahoo!

查看更多
看我几分像从前
3楼-- · 2020-01-29 05:16

Agree with the accepted answer, using Stanford Core NLP is a no brainer.

However, in 2016 there are some incompatibilities interfacing the Stanford Parser with the later versions of the stanford core nlp (I had issues with Stanford Core NLP v3.5).

Here is what I did to parse text into sentences using Ruby interfacing with Stanford Core NLP:

  1. Install the Stanford CoreNLP gem - its still maintained and works, it has has been a struggle to find NLP ruby gems that work lately:

    gem install stanford-core-nlp

  2. Then following the instructions on the readme for Using the latest version of the Stanford CoreNLP:

Using the latest version of the Stanford CoreNLP (version 3.5.0 as of 31/10/2014) requires some additional manual steps:

  • Download Stanford CoreNLP version 3.5.0 from http://nlp.stanford.edu/.

  • Place the contents of the extracted archive inside the /bin/ folder of the stanford-core-nlp gem (e.g. [...]/gems/stanford-core-nlp-0.x/bin/) or inside the directory location configured by setting StanfordCoreNLP.jar_path.

  • Download the full Stanford Tagger version 3.5.0 from http://nlp.stanford.edu/.
  • Make a directory named 'taggers' inside the /bin/ folder of the stanford-core-nlp gem (e.g. [...]/gems/stanford-core-nlp-0.x/bin/) or inside the directory configured by setting StanfordCoreNLP.jar_path.
  • Place the contents of the extracted archive inside taggers directory.
  • Download the bridge.jar file from https://github.com/louismullie/stanford-core-nlp.
  • Place the downloaded bridger.jar file inside the /bin/ folder of the stanford-core-nlp gem (e.g. [...]/gems/stanford-core-nlp-0.x/bin/taggers/) or inside the directory configured by setting StanfordCoreNLP.jar_path.

Then the ruby code to split text into sentences:

require "stanford-core-nlp"

#I downloaded the StanfordCoreNLP to a custom path:
StanfordCoreNLP.jar_path = "/home/josh/stanford-corenlp-full-2014-10-31/"

StanfordCoreNLP.use :english
StanfordCoreNLP.model_files = {}
StanfordCoreNLP.default_jars = [
  'joda-time.jar',
  'xom.jar',
  'stanford-corenlp-3.5.0.jar',
  'stanford-corenlp-3.5.0-models.jar',
  'jollyday.jar',
  'bridge.jar'
]

pipeline =  StanfordCoreNLP.load(:tokenize, :ssplit)

text = 'Mr. Josh Weir is writing some code. ' + 
  'I am Josh Weir Sr. my son may be Josh Weir Jr. etc. etc.'
text = StanfordCoreNLP::Annotation.new(text)
pipeline.annotate(text)
text.get(:sentences).each{|s| puts "sentence: " + s.to_s}

#output:
#sentence: Mr. Josh Weir is writing some code.
#sentence: I am Josh Weir Sr. my son may be Josh Weir Jr. etc. etc.
查看更多
做自己的国王
4楼-- · 2020-01-29 05:19

I think this is not always resoluble, but you could split based on ". " (a period followed by and empty space) and verifying that the word before the period isn't in a list of words like Mr, Dr, etc.

But, of course, your list may omit some words, and in that case you will get bad results.

查看更多
爷的心禁止访问
5楼-- · 2020-01-29 05:22

Take a look at the Python sentence splitter in NLTK (Natural Language Tool Kit):

Punkt sentence tokenizer

It's based on the following paper:

Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection. Computational Linguistics 32: 485-525.

The approach in the paper is quite interesting. They reduce the problem of sentence splitting to the problem of determining how strongly a word is associated with following punctuation. The overloading of periods after abbreviations is responsible for most of the ambiguous periods, so if you can identify the abbreviations you can identify the sentence boundaries with a high probability.

I've tested this tool informally a bit and it seems to give good results for a variety of (human) languages.

Porting it to Ruby would be non-trivial, but it might give you some ideas.

查看更多
Ridiculous、
6楼-- · 2020-01-29 05:25

Well obviously paragraph.split('.') won't cut it

#split will take a regex as an answer so you might try using a zero-width lookbehind to check for a word starting with a capital letter. Of course this will split on proper nouns so you may have to resort to a regex like this /(Mr\.|Mrs\.|U\.S\.A ...) which would horrendously ugly unless you built the regex programmatically.

查看更多
戒情不戒烟
7楼-- · 2020-01-29 05:26

I'm not a Ruby guy, but a RegEx that split on

 ^(Mr|Mrs|Ms|Mme|Sta|Sr|Sra|Dr|U\.S\.A)[\.\!\?\"] [A-Z]

would be my best bet, once you've got the paragraph (split on \r\n). This assumes that your sentences are proper cased.

Obviously this is a fairly ugly RegEx. What about forcing two spaces between sentences

查看更多
登录 后发表回答