I remember skimming the sentence segmentation section from the NLTK site a long time ago.
I use a crude text replacement of “period” “space” with “period” “manual line break” to achieve sentence segmentation, such as with a Microsoft Word replacement (.
-> .^p
) or a Chrome extension:
https://github.com/AhmadHassanAwan/Sentence-Segmentation
https://chrome.google.com/webstore/detail/sentence-segmenter/jfbhkblbhhigbgdnijncccdndhbflcha
This is instead of an NLP method like the Punkt tokenizer of NLTK.
I segment to help me more easily locate and reread sentences, which can sometimes help with reading comprehension.
What about independent clause boundary disambiguation, and independent clause segmentation? Are there any tools that attempt to do this?
Below is some example text. If an independent clause can be identified within a sentence, there’s a split. Starting from the end of a sentence, it moves left, and greedily splits:
E.g.
Sentence boundary disambiguation
(SBD), also known as sentence
breaking, is the problem in natural
language processing of deciding where
sentences begin and end.
Often, natural language processing
tools
require their input to be divided into
sentences for a number of reasons.
However, sentence boundary
identification is challenging because punctuation
marks are often ambiguous.
For example, a period may
denote an abbreviation, decimal point,
an ellipsis, or an email address - not
the end of a sentence.
About 47% of the periods in the Wall
Street Journal corpus
denote abbreviations.[1]
As well, question marks and
exclamation marks may
appear in embedded quotations,
emoticons, computer code, and slang.
Another approach is to automatically
learn a set of rules from a set of
documents where the sentence
breaks are pre-marked.
Languages like Japanese and Chinese
have unambiguous sentence-ending
markers.
The standard 'vanilla' approach to
locate the end of a sentence:
(a) If
it's a period,
it ends a sentence.
(b) If the preceding
token is on my hand-compiled list of
abbreviations, then
it doesn't end a sentence.
(c) If the next
token is capitalized, then
it ends a sentence.
This
strategy gets about 95% of sentences
correct.[2]
Solutions have been based on a maximum
entropy model.[3]
The SATZ architecture uses a neural
network to
disambiguate sentence boundaries and
achieves 98.5% accuracy.
(I’m not sure if I split it properly.)
If there are no means to segment independent clauses, are there any search terms that I can use to further explore this topic?
Thanks.
To the best of my knowledge, there is no readily available tool to solve this exact problem. Usually, NLP systems do not get into the problem of identifying different types of sentences and clauses as defined by English grammar. There is one paper published in EMNLP which provides an algorithm which uses the SBAR
tag in parse trees to identify independent and dependent clauses in a sentence.
You should find section 3 of this paper useful. It talks about English language syntax in some details, but I don't think the entire paper is relevant to your question.
Note that they have used the Berkeley parser (demo available here), but you can obviously any other constituency parsing tool (e.g. the Stanford parser demo available here).
Chthonic Project gives some good information here:
Clause Extraction using Stanford parser
Part of the answer:
It is probably better if you primarily use the constituenty-based
parse tree, and not the dependencies.
The clauses are indicated by the SBAR tag, which is a clause
introduced by a (possibly empty) subordinating conjunction.
All you need to do is the following:
- Identify the non-root clausal nodes in the parse tree
- Remove (but retain separately) the subtrees rooted at these clausal nodes from the main tree.
- In the main tree (after removal of subtrees in step 2), remove any hanging prepositions, subordinating conjunctions and adverbs.
For a list of all clausal tags (and, in fact, all Penn Treebank tags),
see this list:
http://www.surdeanu.info/mihai/teaching/ista555-fall13/readings/PennTreebankConstituents.html
For an online parse-tree visualization, you may want to use the
online Berkeley parser demo.
It helps a lot in forming a better intuition.
Here's the image generated for your example sentence:
I don't know any tools that do clause segmentation, but in rhetorical structure theory, there is a concept called "elementary discourse unit" which work in a similar way as a clause. They are sometimes, however, slightly smaller than clauses.
You may see the section 2.0 of this manual for more information about this concept:
https://www.isi.edu/~marcu/discourse/tagging-ref-manual.pdf
There are some software available online that can segment sentence into their elementary discourse unit , for instance:
http://alt.qcri.org/tools/discourse-parser/
and
https://github.com/jiyfeng/DPLP
Via user YourWelcomeOrMine from the subreddit /r/LanguageTechnology/:
“I would check out Stanford's CoreNLP. I believe you can customize how
a sentence is broken up.”
Via user Breakthrough from Superuser:
I've found different classifiers using
the NPS Chat Corpus training set to be
very effective for a similar
application.