Entities on my gazette are not recognized

I would like to create a custom NER model. That's what i did:

TRAINING DATA (stanford-ner.tsv):

Hello    O
!    O
My    O
name    O
is    O
Damiano    PERSON
.    O

PROPERTIES (stanford-ner.prop):

trainFile = stanford-ner.tsv
serializeTo = ner-model.ser.gz
map = word=0,answer=1
maxLeft=1
useClassFeature=true
useWord=true
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useDisjunctive=true
useSequences=true
usePrevSequences=true
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
useGazettes=true
gazette=gazzetta.txt
cleanGazette=true

GAZZETTE gazzetta.txt):

PERSON John
PERSON Andrea

I build the model via command line with:

java -classpath "stanford-ner.jar:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier  -prop stanford-ner.prop

And test with:

java -classpath "stanford-ner.jar:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier  -loadClassifier ner-model.ser.gz -textFile test.txt

I did two tests with the following texts:

>>> TEST 1 <<<

TEXT: Hello! My name is Damiano and this is a fake text to test.
OUTPUT Hello/O !/O My/O name/O is/O Damiano/PERSON and/O this/O is/O a/O fake/O text/O to/O test/O ./O

>>> TEST 2 <<<

TEXT: Hello! My name is John and this is a fake text to test.
OUTPUT Hello/O !/O My/O name/O is/O John/O and/O this/O is/O a/O fake/O text/O to/O test/O ./O

As you can see only "Damiano" entity is found. This entity is in my training data but "John" (second test) is inside the gazzette. So the question is.

Why does John entity is not recognized ?

Thank you so much in advance.

标签： machine-learning nlp stanford-nlp named-entity-recognition

3条回答

SAY GOODBYE

2楼-- · 2019-02-22 18:08

Why does John entity is not recognized ?

It looks to me that your minimal example should most probably add "Damiano" to the gazetteer as a PERSON category. Currently, the training data allows the model to learn that "Damiano" is a PERSON label, but I think this is not related to the gazetteer categories (i.e. having PERSON on both sides is not sufficient).

0人赞添加讨论(0) 举报

等我变得足够好

3楼-- · 2019-02-22 18:16

As Stanford FAQ says,

If a gazette is used, this does not guarantee that words in the gazette are always used as a member of the intended class, and it does not guarantee that words outside the gazette will not be chosen. It simply provides another feature for the CRF to train against. If the CRF has higher weights for other features, the gazette features may be overwhelmed.

If you want something that will recognize text as a member of a class if and only if it is in a list of words, you might prefer either the regexner or the tokensregex tools included in Stanford CoreNLP. The CRF NER is not guaranteed to accept all words in the gazette as part of the expected class, and it may also accept words outside the gazette as part of the class.

Btw, it is not a good practice to test machine learning pipelines in a 'unit-test'-way, i.e. with only one or two examples, because it is supposed to work on much greater volume of data and, more importantly, it is probabilistic by nature.

If you want to check if your gazette file is actually used, it may be better to take existent examples (see the bottom of the page linked above for austen.gaz.prop and austen.gaz.txt examples) and replace multiple names by your own ones, then check. If it fails, firstly try to change your test, e.g. add more names, reformulate text and so on.

0人赞添加讨论(0) 举报

Entities on my gazette are not recognized

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间