Creating a simple concept graph from unstructured

2019-05-16 21:10发布

I need to parse unstructured text and convert relevant concepts into format so that all the triplets can be merged to form a graph. e.g. If I have 2 sentences like A improves B and B improves C, i should be able to create a graph like

A ---> B (improves) B-----> C (improves).

And later on if asked a question like What is the use of A, the system should provide an answer like A improves B and C.

As far as I know, there is no direct lib for this. I tried POS tagging using Standford OpenNLP lib,followed by triplet formation and their combination. However it leads to many cases.

What is the best way to do this? Will ontology bases parsing help?

1条回答
祖国的老花朵
2楼-- · 2019-05-16 21:50

This is an interesting problem.... one of my favorites :)

I did something like this once, and I took a hybrid approach. Hybrid meaning some pieces were NLP, others were simple rules. In my particular case, I was generating a graph based on organization entities (extracted with NER), and then using a verb phrase categorizer (based no rules and regex). So in essence, I ran NER on each sentence, and got some solid org names. Then I ran the sentence chunker on the same sentence and parsed out the verb phrases. Then I used a simple keyword->concept regex to categorize the verb phrase. I did not try to use the position of each in the sentence to infer any kind of graph directionality, so I just ended up writing triplets of {EntityA,EntityB,VerbPhrases[], VerbCategories[]} to an index. Obviously I had to make sure my org entities were not the same tokens as the verb phrase in noisy sentences, and I assumed that coexistence within the sentence was justification enough to create an edge between the two entities. This is just a concrete example of what I did, it has flaws, but in practice it actually worked very well and enabled very powerful searches. My approach did not account for adjacent sentence correlation (the data source I was using had long winded sentences generally), but I did contemplate using a proximity based scoring technique to assign a probability of actual correlation to conjoin entities from nearby sentences, likely utilizing paragraph boundaries as well for another heuristic to help with validation.

There are many way to "attempt" to do this, all of them will suck in some way or another, and edge cases will be bountiful and interesting, it's about pragmatism and what you are trying to enable. In fact, I predict that coreference resolution will be your next problem (when the entity in sentence A is referred to as He or Her in subsequent sentences....etc), and then your next problem after that will be cross-document entity resolution (Bob in DocA may or may not be Bob from DocB). Also, I highly doubt anything will produce the triplet format, YOU will have to create it using the tokens that NER will give you from the sentences.

HTH

查看更多
登录 后发表回答