I'm following this FAQ https://nlp.stanford.edu/software/crf-faq.shtml for training my own classifier and I noticed that the performance evaluation output does not match the results (or at least not in the way I expect). Specifically this section
CRFClassifier tagged 16119 words in 1 documents at 13824.19 words per second.
Entity P R F1 TP FP FN
MYLABEL 1.0000 0.9961 0.9980 255 0 1
Totals 1.0000 0.9961 0.9980 255 0 1
I expect TP
to be all instances where the predicted label matched the golden label, FP
to be all instances where MYLABEL
was predicted but the golden label was O
, FN
to be all instances where O
was predicted but the golden was MYLABEL
.
If I calculate those numbers myself from the output of the program, I get completely different numbers with no relation to what the program prints. I've tried this with various test files.
I'm using Stanford NER - v3.7.0 - 2016-10-31
Am I missing something?
The F1 scores are over entities not labels.
Example:
In this example there are two possible entities:
Entities are created by taking all adjacent tokens with the same label. (Unless you use a more complicated BIO labeling scheme ; BIO schemes have tags like I-PERSON and B-PERSON to indicate whether a token is the beginning of an entity, etc...).