How do you parse a paragraph of text into sentence

How do you take paragraph or large amount of text and break it into sentences (perferably using Ruby) taking into account cases such as Mr. and Dr. and U.S.A? (Assuming you just put the sentences into an array of arrays)

UPDATE: One possible solution I thought of involves using a parts-of-speech tagger (POST) and a classifier to determine the end of a sentence:

Getting data from Mr. Jones felt the warm sun on his face as he stepped out onto the balcony of his summer home in Italy. He was happy to be alive.

CLASSIFIER Mr./PERSON Jones/PERSON felt/O the/O warm/O sun/O on/O his/O face/O as/O he/O stepped/O out/O onto/O the/O balcony/O of/O his/O summer/O home/O in/O Italy/LOCATION ./O He/O was/O happy/O to/O be/O alive/O ./O

POST Mr./NNP Jones/NNP felt/VBD the/DT warm/JJ sun/NN on/IN his/PRP$ face/NN as/IN he/PRP stepped/VBD out/RP onto/IN the/DT balcony/NN of/IN his/PRP$ summer/NN home/NN in/IN Italy./NNP He/PRP was/VBD happy/JJ to/TO be/VB alive./IN

Can we assume, since Italy is a location, the period is the valid end of the sentence? Since ending on "Mr." would have no other parts-of-speech, can we assume this is not a valid end-of-sentence period? Is this the best answer to the my question?

Thoughts?

标签： ruby text parsing split nlp

15条回答

老娘就宠你

2楼-- · 2020-01-29 05:07

This is a hard problem if you really care about getting it right. You'll find that NLP parser packages probably provide this functionality. If you want something faster, you'll need to end up duplicating some of that functionality with a trained probabilistic function of a window of tokens (you'd probably want to count a line feed as a token, since i may drop a period if it's the end of a paragraph).

Edit: I recommend the Stanford parser if you can use Java. I have no recommendation for other languages, but I'm very interested in hearing what else is out there that is open source.

0人赞添加讨论(0) 举报

叼着烟拽天下

3楼-- · 2020-01-29 05:07

I've not tried it but if English is the only language you are concerned with I'd suggest giving Lingua::EN::Readability a look.

Lingua::EN::Readability is a Ruby module which calculates statistics on English text. It can supply counts of words, sentences and syllables. It can also calculate several readability measures, such as a Fog Index and a Flesch-Kincaid level. The package includes the module Lingua::EN::Sentence, which breaks English text into sentences heeding abbreviations, and Lingua::EN::Syllable, which can guess the number of syllables in a written English word. If a pronouncing dictionary is available it can look up the number of syllables in the dictionary for greater accuracy

The bit you want is in sentence.rb as follows:

module Lingua
module EN
# The module Lingua::EN::Sentence takes English text, and attempts to split it
# up into sentences, respecting abbreviations.

module Sentence
  EOS = "\001" # temporary end of sentence marker

  Titles   = [ 'jr', 'mr', 'mrs', 'ms', 'dr', 'prof', 'sr', 'sen', 'rep', 
         'rev', 'gov', 'atty', 'supt', 'det', 'rev', 'col','gen', 'lt', 
         'cmdr', 'adm', 'capt', 'sgt', 'cpl', 'maj' ]

  Entities = [ 'dept', 'univ', 'uni', 'assn', 'bros', 'inc', 'ltd', 'co', 
         'corp', 'plc' ]

  Months   = [ 'jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 
         'aug', 'sep', 'oct', 'nov', 'dec', 'sept' ]

  Days     = [ 'mon', 'tue', 'wed', 'thu', 'fri', 'sat', 'sun' ]

  Misc     = [ 'vs', 'etc', 'no', 'esp', 'cf' ]

  Streets  = [ 'ave', 'bld', 'blvd', 'cl', 'ct', 'cres', 'dr', 'rd', 'st' ]

  @@abbreviations = Titles + Entities + Months + Days + Streets + Misc

  # Split the passed text into individual sentences, trim these and return
  # as an array. A sentence is marked by one of the punctuation marks ".", "?"
  # or "!" followed by whitespace. Sequences of full stops (such as an
  # ellipsis marker "..." and stops after a known abbreviation are ignored.
  def Sentence.sentences(text)

    text = text.dup

    # initial split after punctuation - have to preserve trailing whitespace
    # for the ellipsis correction next
    # would be nicer to use look-behind and look-ahead assertions to skip
    # ellipsis marks, but Ruby doesn't support look-behind
    text.gsub!( /([\.?!](?:\"|\'|\)|\]|\})?)(\s+)/ ) { $1 << EOS << $2 }

    # correct ellipsis marks and rows of stops
    text.gsub!( /(\.\.\.*)#{EOS}/ ) { $1 }

    # correct abbreviations
    # TODO - precompile this regex?
    text.gsub!( /(#{@@abbreviations.join("|")})\.#{EOS}/i ) { $1 << '.' }

    # split on EOS marker, strip gets rid of trailing whitespace
    text.split(EOS).map { | sentence | sentence.strip }
  end

  # add a list of abbreviations to the list that's used to detect false
  # sentence ends. Return the current list of abbreviations in use.
  def Sentence.abbreviation(*abbreviations)
    @@abbreviations += abbreviations
    @@abbreviations
  end
end
end
end

0人赞添加讨论(0) 举报

老娘就宠你

4楼-- · 2020-01-29 05:08

Try looking at the Ruby wrapper around the Stanford Parser. It has a getSentencesFromString() function.

0人赞添加讨论(0) 举报

▲ chillily

5楼-- · 2020-01-29 05:08

THANKS!

I really enjoyed this discussion, so I got interested in the parser. I tried it and I wrote down some notes on how to get everything working with Ruby and or Rails!

Trying to go with the regular expression was a nightmare..

my 2 cents

0人赞添加讨论(0) 举报

对你真心纯属浪费

6楼-- · 2020-01-29 05:09

Looks like this ruby gem might do the trick.

https://github.com/zencephalon/Tactful_Tokenizer

0人赞添加讨论(0) 举报

对你真心纯属浪费

7楼-- · 2020-01-29 05:09

The answer by Dr. Manning is the most appropriate if you are considering the JAVA (and Ruby too in hard way ;)). It is here-

There is a sentence splitter: edu.stanford.nlp.process.DocumentPreprocessor . Try the command: java edu.stanford.nlp.process.DocumentPreprocessor /u/nlp/data/lexparser/textDocument.txt

oneTokenizedSentencePerLine.txt . (This is done via a (good but heuristic) FSM, so it's fast; you're not running the probabilistic parser.)

But a little suggestion if we modify the command java edu.stanford.nlp.process.DocumentPreprocessor /u/nlp/data/lexparser/textDocument.txt > oneTokenizedSentencePerLine.txt TO java edu.stanford.nlp.process.DocumentPreprocessor -file /u/nlp/data/lexparser/textDocument.txt > oneTokenizedSentencePerLine.txt . It will work fine as you need to specify what kind of file is being presented as input. So -file for Text file, -html for HTML, etc.

0人赞添加讨论(0) 举报

1 2 3 下一页

How do you parse a paragraph of text into sentence

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间