Independent clause boundary disambiguation, and in

2020-07-11 05:12发布

问题:

I remember skimming the sentence segmentation section from the NLTK site a long time ago.

I use a crude text replacement of “period” “space” with “period” “manual line break” to achieve sentence segmentation, such as with a Microsoft Word replacement (. -> .^p) or a Chrome extension:

https://github.com/AhmadHassanAwan/Sentence-Segmentation

https://chrome.google.com/webstore/detail/sentence-segmenter/jfbhkblbhhigbgdnijncccdndhbflcha

This is instead of an NLP method like the Punkt tokenizer of NLTK.

I segment to help me more easily locate and reread sentences, which can sometimes help with reading comprehension.

What about independent clause boundary disambiguation, and independent clause segmentation? Are there any tools that attempt to do this?

Below is some example text. If an independent clause can be identified within a sentence, there’s a split. Starting from the end of a sentence, it moves left, and greedily splits:

E.g.

Sentence boundary disambiguation (SBD), also known as sentence breaking, is the problem in natural language processing of deciding where

sentences begin and end.

Often, natural language processing tools

require their input to be divided into sentences for a number of reasons.

However, sentence boundary identification is challenging because punctuation

marks are often ambiguous.

For example, a period may

denote an abbreviation, decimal point, an ellipsis, or an email address - not the end of a sentence.

About 47% of the periods in the Wall Street Journal corpus

denote abbreviations.[1]

As well, question marks and exclamation marks may

appear in embedded quotations, emoticons, computer code, and slang.

Another approach is to automatically

learn a set of rules from a set of documents where the sentence

breaks are pre-marked.

Languages like Japanese and Chinese

have unambiguous sentence-ending markers.

The standard 'vanilla' approach to

locate the end of a sentence:

(a) If

it's a period,

it ends a sentence.

(b) If the preceding

token is on my hand-compiled list of abbreviations, then

it doesn't end a sentence.

(c) If the next

token is capitalized, then

it ends a sentence.

This

strategy gets about 95% of sentences correct.[2]

Solutions have been based on a maximum entropy model.[3]

The SATZ architecture uses a neural network to

disambiguate sentence boundaries and achieves 98.5% accuracy.

(I’m not sure if I split it properly.)

If there are no means to segment independent clauses, are there any search terms that I can use to further explore this topic?

Thanks.

回答1:

To the best of my knowledge, there is no readily available tool to solve this exact problem. Usually, NLP systems do not get into the problem of identifying different types of sentences and clauses as defined by English grammar. There is one paper published in EMNLP which provides an algorithm which uses the SBAR tag in parse trees to identify independent and dependent clauses in a sentence.

You should find section 3 of this paper useful. It talks about English language syntax in some details, but I don't think the entire paper is relevant to your question.

Note that they have used the Berkeley parser (demo available here), but you can obviously any other constituency parsing tool (e.g. the Stanford parser demo available here).



回答2:

Chthonic Project gives some good information here:

Clause Extraction using Stanford parser

Part of the answer:

It is probably better if you primarily use the constituenty-based parse tree, and not the dependencies.

The clauses are indicated by the SBAR tag, which is a clause introduced by a (possibly empty) subordinating conjunction.

All you need to do is the following:

  1. Identify the non-root clausal nodes in the parse tree
  2. Remove (but retain separately) the subtrees rooted at these clausal nodes from the main tree.
  3. In the main tree (after removal of subtrees in step 2), remove any hanging prepositions, subordinating conjunctions and adverbs.

For a list of all clausal tags (and, in fact, all Penn Treebank tags), see this list: http://www.surdeanu.info/mihai/teaching/ista555-fall13/readings/PennTreebankConstituents.html

For an online parse-tree visualization, you may want to use the online Berkeley parser demo.

It helps a lot in forming a better intuition.

Here's the image generated for your example sentence:



回答3:

I don't know any tools that do clause segmentation, but in rhetorical structure theory, there is a concept called "elementary discourse unit" which work in a similar way as a clause. They are sometimes, however, slightly smaller than clauses.

You may see the section 2.0 of this manual for more information about this concept:

https://www.isi.edu/~marcu/discourse/tagging-ref-manual.pdf

There are some software available online that can segment sentence into their elementary discourse unit , for instance:

http://alt.qcri.org/tools/discourse-parser/

and

https://github.com/jiyfeng/DPLP



回答4:

Via user YourWelcomeOrMine from the subreddit /r/LanguageTechnology/:

“I would check out Stanford's CoreNLP. I believe you can customize how a sentence is broken up.”



回答5:

Via user Breakthrough from Superuser:

I've found different classifiers using the NPS Chat Corpus training set to be very effective for a similar application.