The failure in using CRF+0.58 train NE Model

2019-09-14 16:43发布

when i use CRF++0.58 to model a NE and progarm have a problem:

"reading training data:tagger.cpp(399) [feature_index_->buildFeatures(this)] 0.00s"

  1. the develop environment:
    • red hat linux 6.5,gcc 5.0,CRF++0.58
  2. written feature template:
    • template
  3. dataset:
    • Boson_train.txt
    • Boson_test.txt
    • the first column is words ,the second column is pos,the third column is NER tagger
  4. the problem:
    • when i want to train the NER model, i type this sentences "crf_learn -f 3 -c 4.0 template Boson_train crf_model", and i got this notification, "reading training data:tagger.cpp(399) [feature_index_->buildFeatures(this)] 0.00s". I can't understand the C++ language, so i can't fix the problem.
  5. the method i tryed:
    • 1.change the encode type of dataset. I use notepad++ to change "utf-8 with no BOM" to "utf-8". It didn't work.
    • 2.change the delimiter from '\t' to ' '(space). It didn't work.
    • 3.And i think maybe the template was wrong.So i use the crf++0.58/example/seg/template for test. It worked. But this template is simple, so I use /example/JapaneseNE/template which is more similar with my feature template. It didn't work. Then, i check the JapaneseNE example It works well. So i got confused. Is there someone can help me.
  6. template

    • U00:%x[-2,0]
    • U01:%x[-1,0]
    • U02:%x[0,0]
    • U03:%x[1,0]
    • U04:%x[2,0]
    • U05:%x[-2,0]/%x[-1,0]/%x[0,0]
    • U06:%x[-1,0]/%x[0,0]/%x[1,0]
    • U07:%x[0,0]/%x[1,0]/%x[2,0]
    • U08:%x[-1,0]/%x[0,0]
    • U09:%x[0,0]/%x[1,0]

    • U10:%x[-2,1]/%x[0,1]

    • U11:%x[-2,1]/%x[1,1]
    • U11:%x[-1,1]/%x[0,1]
    • U12:%x[0,0]/%x[0,1]
    • U13:%x[0,1]/%x[1,1]
    • U14:%x[0,1]/%x[2,1]
    • U15:%x[-1,0]/%x[0,1]
    • U16:%x[-1,0]/%x[-1,1]
    • U17:%x[1,0]/%x[1,1]
    • U18:%x[1,0]/%x[1,1]
    • U19:%x[2,0]/%x[2,1]

    • U20:%x[-1,2]

    • U21:%x[-2,2]
    • U22:%x[0,1]/%x[-1,2]
    • U23:%x[0,1]/%x[-2,2]
    • U24:%x[0,0]/%x[-1,2]
    • U25:%x[0,0]/%x[-2,2]
    • U26:%x[-1,2]/%x[-2,2]/%x[0,1]
    • U27:%x[-2,2]/%x[0,1]/%x[1,1]
    • U28:%x[-1,1]/%x[-1,2]/%x[0,1]
    • U29:%x[-1,2]/%x[0,0]/%x[0,1]
  7. Boson_train
    • 浙江 ns B_product_name
    • 在线 b I_product_name
    • 杭州 ns I_product_name
    • 4 m B_time
    • 月 m I_time
    • 25 m I_time
    • 日 m I_time
    • 讯 ng Out
    • ( x Out
    • 记者 n Out
    • x Out
    • x B_person_name
    • 施宇翔 nr I_person_name
    • x Out
    • 通讯员 n B_person_name
    • x Out
    • 方英 nr B_person_name
    • ) x Out
    • 毒贩 n Out
    • 很 zg Out
    • “ x Out
    • 时髦 nr Out
    • ” x Out
    • , x Out
    • 用 p Out
    • 微信 vn B_product_name
    • 交易 n Out
    • 毒品 n Out
    • 。 x Out
    • 没 v Out
    • 料想 v Out
    • 警方 n B_person_name
    • 也 d Out

标签: linux nlp crf++
1条回答
你好瞎i
2楼-- · 2019-09-14 17:20

You were debugging in the right direction. The issue is indeed with your template file.

Your training data has 3 columns (column 0:word, column 1:pos-tag and column 2:tag).

You cannot use the tag as feature, but your template file has reference to it (i.e, column 2) in many feature definitions (see, U20 to U29). Your training should work after removing/correcting these.

Hope this helps :)

You can also checkout these video tutorials for better understanding of Template Files and Training NER with CRF++ :

1) https://youtu.be/GJHeTvDkIaE

2) https://youtu.be/Ur5umC4BwN4

查看更多
登录 后发表回答