培训正克NER与斯坦福NLP(Training n-gram NER with Stanford N

最近我一直在努力训练的n-gram实体与斯坦福核心NLP。我按照以下教程- http://nlp.stanford.edu/software/crf-faq.shtml#b

有了这个，我可以只指定的单字组令牌和它所属的类。任何一个可以指导我，这样我可以把它扩展到正克。我试图提取已知实体像聊天数据集电影的名字。

请通过指导我的情况下，我有错interpretted斯坦福教程和同样可以用于正克培训。

什么我坚持的是以下属性

#structure of your training file; this tells the classifier
#that the word is in column 0 and the correct answer is in
#column 1
map = word=0,answer=1

在这里，第一列是字（单字组）和第2栏是实体，例如

CHAPTER O
I   O
Emma    PERS
Woodhouse   PERS

现在，我需要训练已知实体（说电影的名字）像绿巨人 ， 泰坦尼克号等电影，它会很容易使用这种方法。但如果我要培养我知道你去年夏天干了什么或者小鬼当街 ，什么是最好的方法呢？

Answer 1:

那是个漫长的等待在这里的回答。我一直无法找出得到它使用斯坦福核心所做的那样。但是任务完成了。我已经使用了LingPipe NLP库相同。这里只是引用了答案，因为，我觉得有人可能从中受益。

请查看Lingpipe牌潜水前的情况下，你是一个开发人员或研究人员或什么都实现。

Lingpipe提供各种NER方法。

1）基于字典NER

2）统计学NER（基于HMM）

3）基于规则的NER等

我已经使用了字典以及统计方法。

第一种是直接的查找方法，第二个是基础训练。

对于基于字典NER的例子可以发现在这里

该statstical方法需要培训档案。我已经使用以下格式文件 -

<root>
<s> data line with the <ENAMEX TYPE="myentity">entity1</ENAMEX>  to be trained</s>
...
<s> with the <ENAMEX TYPE="myentity">entity2</ENAMEX>  annotated </s>
</root>

然后我用下面的代码来训练的实体。

import java.io.File;
import java.io.IOException;

import com.aliasi.chunk.CharLmHmmChunker;
import com.aliasi.corpus.parsers.Muc6ChunkParser;
import com.aliasi.hmm.HmmCharLmEstimator;
import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory;
import com.aliasi.tokenizer.TokenizerFactory;
import com.aliasi.util.AbstractExternalizable;

@SuppressWarnings("deprecation")
public class TrainEntities {

    static final int MAX_N_GRAM = 50;
    static final int NUM_CHARS = 300;
    static final double LM_INTERPOLATION = MAX_N_GRAM; // default behavior

    public static void main(String[] args) throws IOException {
        File corpusFile = new File("inputfile.txt");// my annotated file
        File modelFile = new File("outputmodelfile.model"); 

        System.out.println("Setting up Chunker Estimator");
        TokenizerFactory factory
            = IndoEuropeanTokenizerFactory.INSTANCE;
        HmmCharLmEstimator hmmEstimator
            = new HmmCharLmEstimator(MAX_N_GRAM,NUM_CHARS,LM_INTERPOLATION);
        CharLmHmmChunker chunkerEstimator
            = new CharLmHmmChunker(factory,hmmEstimator);

        System.out.println("Setting up Data Parser");
        Muc6ChunkParser parser = new Muc6ChunkParser();  
        parser.setHandler( chunkerEstimator);

        System.out.println("Training with Data from File=" + corpusFile);
        parser.parse(corpusFile);

        System.out.println("Compiling and Writing Model to File=" + modelFile);
        AbstractExternalizable.compileTo(chunkerEstimator,modelFile);
    }

}

并测试NER我用下面的类

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.Set;

import com.aliasi.chunk.Chunk;
import com.aliasi.chunk.Chunker;
import com.aliasi.chunk.Chunking;
import com.aliasi.util.AbstractExternalizable;

public class Recognition {
    public static void main(String[] args) throws Exception {
        File modelFile = new File("outputmodelfile.model");
        Chunker chunker = (Chunker) AbstractExternalizable
                .readObject(modelFile);
        String testString="my test string";
            Chunking chunking = chunker.chunk(testString);
            Set<Chunk> test = chunking.chunkSet();
            for (Chunk c : test) {
                System.out.println(testString + " : "
                        + testString.substring(c.start(), c.end()) + " >> "
                        + c.type());

        }
    }
}

代码礼貌：谷歌:)

Answer 2:

答案在你引用的例子，其中“艾玛伍德豪斯”是一个名称基本放弃。默认的模式，我们提供使用IO编码，并假定同一类的相邻的标记是相同的实体的一部分。在许多情况下，这几乎总是正确的，并保持模型简单。但是，如果你不想这样做，你可以训练与其他标签编码，如常用的IOB编码，在那里你会代替标签的东西NER型号：

Emma    B-PERSON
Woodhouse    I-PERSON

然后，将相同类别的相邻的标记，但不是相同的实体可以被表示。

Answer 3:

我面对标记为NGRAM AUTOMATIVE domain.I短语一直在寻找可以用来在后一阶段，以建立培训档案的有效关键词映射着同样的挑战。我结束了在NLP管道使用regexNER，通过提供与正则表达式（NGRAM术语的含义）及其相应的标签映射文件。请注意，在这种情况下取得了不NER机器学习。希望这些信息可以帮助别人！

文章来源: Training n-gram NER with Stanford NLP