I would like to create a custom NER model. That's what i did:
TRAINING DATA (stanford-ner.tsv):
Hello O
! O
My O
name O
is O
Damiano PERSON
. O
PROPERTIES (stanford-ner.prop):
trainFile = stanford-ner.tsv
serializeTo = ner-model.ser.gz
map = word=0,answer=1
maxLeft=1
useClassFeature=true
useWord=true
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useDisjunctive=true
useSequences=true
usePrevSequences=true
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
useGazettes=true
gazette=gazzetta.txt
cleanGazette=true
GAZZETTE gazzetta.txt):
PERSON John
PERSON Andrea
I build the model via command line with:
java -classpath "stanford-ner.jar:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -prop stanford-ner.prop
And test with:
java -classpath "stanford-ner.jar:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier ner-model.ser.gz -textFile test.txt
I did two tests with the following texts:
>>> TEST 1 <<<
TEXT: Hello! My name is Damiano and this is a fake text to test.
OUTPUT Hello/O !/O My/O name/O is/O Damiano/PERSON and/O this/O is/O a/O fake/O text/O to/O test/O ./O
>>> TEST 2 <<<
TEXT: Hello! My name is John and this is a fake text to test.
OUTPUT Hello/O !/O My/O name/O is/O John/O and/O this/O is/O a/O fake/O text/O to/O test/O ./O
As you can see only "Damiano" entity is found. This entity is in my training data but "John" (second test) is inside the gazzette. So the question is.
Why does John entity is not recognized ?
Thank you so much in advance.
It looks to me that your minimal example should most probably add "Damiano" to the gazetteer as a PERSON category. Currently, the training data allows the model to learn that "Damiano" is a PERSON label, but I think this is not related to the gazetteer categories (i.e. having PERSON on both sides is not sufficient).
As Stanford FAQ says,
Btw, it is not a good practice to test machine learning pipelines in a 'unit-test'-way, i.e. with only one or two examples, because it is supposed to work on much greater volume of data and, more importantly, it is probabilistic by nature.
If you want to check if your gazette file is actually used, it may be better to take existent examples (see the bottom of the page linked above for
austen.gaz.prop
andausten.gaz.txt
examples) and replace multiple names by your own ones, then check. If it fails, firstly try to change your test, e.g. add more names, reformulate text and so on.gazzette will only help for extracting extra features from the training data, if you don't have any occurrence of these words inside your training data or any connection to labeled tokens, your model will not benefits from that. One of the experiments that I would suggest is to add
Damiano
to your gazzette.