when i use CRF++0.58 to model a NE and progarm have a problem:
"reading training data:tagger.cpp(399) [feature_index_->buildFeatures(this)] 0.00s"
- the develop environment:
- red hat linux 6.5,gcc 5.0,CRF++0.58
- written feature template:
- template
- dataset:
- Boson_train.txt
- Boson_test.txt
- the first column is words ,the second column is pos,the third column is NER tagger
- the problem:
- when i want to train the NER model, i type this sentences "crf_learn -f 3 -c 4.0 template Boson_train crf_model", and i got this notification, "reading training data:tagger.cpp(399) [feature_index_->buildFeatures(this)] 0.00s". I can't understand the C++ language, so i can't fix the problem.
- the method i tryed:
- 1.change the encode type of dataset. I use notepad++ to change "utf-8 with no BOM" to "utf-8". It didn't work.
- 2.change the delimiter from '\t' to ' '(space). It didn't work.
- 3.And i think maybe the template was wrong.So i use the crf++0.58/example/seg/template for test. It worked. But this template is simple, so I use /example/JapaneseNE/template which is more similar with my feature template. It didn't work. Then, i check the JapaneseNE example It works well. So i got confused. Is there someone can help me.
template
- U00:%x[-2,0]
- U01:%x[-1,0]
- U02:%x[0,0]
- U03:%x[1,0]
- U04:%x[2,0]
- U05:%x[-2,0]/%x[-1,0]/%x[0,0]
- U06:%x[-1,0]/%x[0,0]/%x[1,0]
- U07:%x[0,0]/%x[1,0]/%x[2,0]
- U08:%x[-1,0]/%x[0,0]
U09:%x[0,0]/%x[1,0]
U10:%x[-2,1]/%x[0,1]
- U11:%x[-2,1]/%x[1,1]
- U11:%x[-1,1]/%x[0,1]
- U12:%x[0,0]/%x[0,1]
- U13:%x[0,1]/%x[1,1]
- U14:%x[0,1]/%x[2,1]
- U15:%x[-1,0]/%x[0,1]
- U16:%x[-1,0]/%x[-1,1]
- U17:%x[1,0]/%x[1,1]
- U18:%x[1,0]/%x[1,1]
U19:%x[2,0]/%x[2,1]
U20:%x[-1,2]
- U21:%x[-2,2]
- U22:%x[0,1]/%x[-1,2]
- U23:%x[0,1]/%x[-2,2]
- U24:%x[0,0]/%x[-1,2]
- U25:%x[0,0]/%x[-2,2]
- U26:%x[-1,2]/%x[-2,2]/%x[0,1]
- U27:%x[-2,2]/%x[0,1]/%x[1,1]
- U28:%x[-1,1]/%x[-1,2]/%x[0,1]
- U29:%x[-1,2]/%x[0,0]/%x[0,1]
- Boson_train
- 浙江 ns B_product_name
- 在线 b I_product_name
- 杭州 ns I_product_name
- 4 m B_time
- 月 m I_time
- 25 m I_time
- 日 m I_time
- 讯 ng Out
- ( x Out
- 记者 n Out
- x Out
- x B_person_name
- 施宇翔 nr I_person_name
- x Out
- 通讯员 n B_person_name
- x Out
- 方英 nr B_person_name
- ) x Out
- 毒贩 n Out
- 很 zg Out
- “ x Out
- 时髦 nr Out
- ” x Out
- , x Out
- 用 p Out
- 微信 vn B_product_name
- 交易 n Out
- 毒品 n Out
- 。 x Out
- 没 v Out
- 料想 v Out
- 警方 n B_person_name
- 也 d Out
You were debugging in the right direction. The issue is indeed with your template file.
Your training data has 3 columns (column 0:
word
, column 1:pos-tag
and column 2:tag
).You cannot use the
tag
as feature, but your template file has reference to it (i.e, column 2) in many feature definitions (see, U20 to U29). Your training should work after removing/correcting these.Hope this helps :)
You can also checkout these video tutorials for better understanding of Template Files and Training NER with CRF++ :
1) https://youtu.be/GJHeTvDkIaE
2) https://youtu.be/Ur5umC4BwN4