I have a complex sentence and I need to separate it into main and dependent clause.
For example for the sentence
ABC cites the fact that chemical additives are banned in many countries and feels they may be banned in this state too.
The split required
1)ABC cites the fact
2)chemical additives are banned in many countries
3)ABC feels they may be banned in this state too.
I think I could use the Stanford Parser tree or dependencies, but I am not sure how to proceed from here.
The tree
(ROOT (S (NP (NNP ABC)) (VP (VBZ cites) (NP (DT the) (NN fact)) (SBAR (IN that) (S (NP (NN chemical) (NNS additives)) (VP (VP (VBP are) (VP (VBN banned) (PP (IN in) (NP (JJ many) (NNS countries))))) (CC and) (VP (VBZ feels) (SBAR (S (NP (PRP they)) (VP (MD may) (VP (VB be) (VP (VBN banned) (PP (IN in) (NP (DT this) (NN state))) (ADVP (RB too)))))))))))) (. .)))
and the collapsed dependency parse
nsubj(cites-2, ABC-1) root(ROOT-0, cites-2) det(fact-4, the-3) dobj(cites-2, fact-4) mark(banned-9, that-5) nn(additives-7, chemical-6) nsubjpass(banned-9, additives-7) nsubj(feels-14, additives-7) auxpass(banned-9, are-8) ccomp(cites-2, banned-9) amod(countries-12, many-11) prep_in(banned-9, countries-12) ccomp(cites-2, feels-14) conj_and(banned-9, feels-14) nsubjpass(banned-18, they-15) aux(banned-18, may-16) auxpass(banned-18, be-17) ccomp(feels-14, banned-18) det(state-21, this-20) prep_in(banned-18, state-21) advmod(banned-18, too-22)
It is probably better if you primarily use the constituenty-based parse tree, and not the dependencies. The dependencies will be helpful, but only after the main work is done! I am going to explain this towards the end of my answer.
This is because constituency-parse is based on phrase structure grammar, which is the most relevant if you are seeking to extract clauses from a sentence. It can be done using dependencies as well, but in that case, you will essentially be reconstructing the phrase structure -- starting from the root and looking at dependent nodes (e.g.
ABC
andfacts
are the nominal subject and direct object of the verbcites
, and so on ... ).It is helpful to visualize the parse tree, however. In your example, the clauses are indicated by the SBAR tag, which is a clause introduced by a (possibly empty) subordinating conjunction. All you need to do is the following:
In step 3, what I mean by "hanging" is that any prepositions, etc. whose dependency has been removed in step 2. E.g., from "ABC cites the fact that", you need to remove the preposition/subordinating-conjunction "that" because its dependent node "banned" was removed in step 2. You will thus have three independent clauses:
The only issue here is the connection ABC--feels. For this, note that both "banned" and "feels" are complements of the verb "cites", and hence, have the same subject, which is "ABC"! And you're done. When this is done, you will get a fourth clause, "ABC feels", which is something you may or may not want to include in your final result.
For a list of all clausal tags (and, in fact, all Penn Treebank tags), see this list: http://www.surdeanu.info/mihai/teaching/ista555-fall13/readings/PennTreebankConstituents.html
For an online parse-tree visualization, you may want to use the online Berkeley parser demo. It helps a lot in forming a better intuition. Here's the image generated for your example sentence:
Caveats
Apart from these three pitfalls, the above algorithm should work quite accurately.