I'm having some trouble understanding the changes made to the coref resolver in the last version of the Stanford NLP tools. As an example, below is a sentence and the corresponding CorefChainAnnotation:
The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons.
{1=[1 1, 1 2], 5=[1 3], 7=[1 4], 9=[1 5]}
I am not sure I understand the meaning of these numbers. Looking at the source doesn't really help either.
Thank you
The first number is a cluster id (representing tokens, which stand for the same entity), see source code of
SieveCoreferenceSystem#coref(Document)
. The pair numbers are outout of CorefChain#toString():where position is a set of postion pairs of entity mentioning (to get them use
CorefChain.getCorefMentions()
). Here is an example of a complete code (in groovy), which shows how to get from positions to tokens:Output (I do not understand where 's' comes from):
These are the recent results from the annotator.
The markings are as follows :
The text belonging to the same cluster refers to the same context.
I've been working with the coreference dependency graph and I started by using the other answer to this question. After a while though I realized that this algorithm above is not exactly correct. The output it produced is not even close to the modified version I have.
For anyone else who uses this article, here is the algorithm I ended up with which also filters out self references because every representativeMention also mentions itself and a lot of mentions only reference themselves.
And the final output for your example sentence is the following:
Usually "the atom" ends up being the representative mention but in the case it doesn't surprisingly. Another example with a slightly more accurate output is for the following sentence:
The Revolutionary War occurred during the 1700s and it was the first war in the United States.
produces the following output: