How do you take paragraph or large amount of text and break it into sentences (perferably using Ruby) taking into account cases such as Mr. and Dr. and U.S.A? (Assuming you just put the sentences into an array of arrays)
UPDATE: One possible solution I thought of involves using a parts-of-speech tagger (POST) and a classifier to determine the end of a sentence:
Getting data from Mr. Jones felt the warm sun on his face as he stepped out onto the balcony of his summer home in Italy. He was happy to be alive.
CLASSIFIER Mr./PERSON Jones/PERSON felt/O the/O warm/O sun/O on/O his/O face/O as/O he/O stepped/O out/O onto/O the/O balcony/O of/O his/O summer/O home/O in/O Italy/LOCATION ./O He/O was/O happy/O to/O be/O alive/O ./O
POST Mr./NNP Jones/NNP felt/VBD the/DT warm/JJ sun/NN on/IN his/PRP$ face/NN as/IN he/PRP stepped/VBD out/RP onto/IN the/DT balcony/NN of/IN his/PRP$ summer/NN home/NN in/IN Italy./NNP He/PRP was/VBD happy/JJ to/TO be/VB alive./IN
Can we assume, since Italy is a location, the period is the valid end of the sentence? Since ending on "Mr." would have no other parts-of-speech, can we assume this is not a valid end-of-sentence period? Is this the best answer to the my question?
Thoughts?
Breaking on a period followed by a space and a capitalized letter wouldn't fly for titles like "Mr. Brown."
The periods make things difficult, but an easy case to handle is exclamation points and question marks. However, there are cases that would make this not work. i.e. the corporate name of Yahoo!
Agree with the accepted answer, using Stanford Core NLP is a no brainer.
However, in 2016 there are some incompatibilities interfacing the Stanford Parser with the later versions of the stanford core nlp (I had issues with Stanford Core NLP v3.5).
Here is what I did to parse text into sentences using Ruby interfacing with Stanford Core NLP:
Install the Stanford CoreNLP gem - its still maintained and works, it has has been a struggle to find NLP ruby gems that work lately:
gem install stanford-core-nlp
Then following the instructions on the readme for Using the latest version of the Stanford CoreNLP:
Then the ruby code to split text into sentences:
I think this is not always resoluble, but you could split based on ". " (a period followed by and empty space) and verifying that the word before the period isn't in a list of words like Mr, Dr, etc.
But, of course, your list may omit some words, and in that case you will get bad results.
Take a look at the Python sentence splitter in NLTK (Natural Language Tool Kit):
Punkt sentence tokenizer
It's based on the following paper:
Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection. Computational Linguistics 32: 485-525.
The approach in the paper is quite interesting. They reduce the problem of sentence splitting to the problem of determining how strongly a word is associated with following punctuation. The overloading of periods after abbreviations is responsible for most of the ambiguous periods, so if you can identify the abbreviations you can identify the sentence boundaries with a high probability.
I've tested this tool informally a bit and it seems to give good results for a variety of (human) languages.
Porting it to Ruby would be non-trivial, but it might give you some ideas.
Well obviously
paragraph.split('.')
won't cut it#split
will take a regex as an answer so you might try using a zero-width lookbehind to check for a word starting with a capital letter. Of course this will split on proper nouns so you may have to resort to a regex like this/(Mr\.|Mrs\.|U\.S\.A ...)
which would horrendously ugly unless you built the regex programmatically.I'm not a Ruby guy, but a RegEx that split on
would be my best bet, once you've got the paragraph (split on \r\n). This assumes that your sentences are proper cased.
Obviously this is a fairly ugly RegEx. What about forcing two spaces between sentences