How do I use IOB tags with Stanford NER?

2019-01-22 10:14发布

There seem to be a few different settings:

iobtags
iobTags
entitySubclassification (IOB1 or IOB2?)
evaluateIOB

Which setting do I use, and how do I use it correctly?

I tried labelling like this:

1997    B-DATE
volvo   B-BRAND
wia64t  B-MODEL
highway B-TYPE
tractor I-TYPE

But on the training output, it seemed to think that B-TYPE and I-TYPE were different classes.

I am using the 2013-11-12 release.

1条回答
男人必须洒脱
2楼-- · 2019-01-22 10:54

How this can be done is currently (2013 releases) a bit of a mess, since there are two different sets of flags for two different DocumentReaderAndWriter implementations. Sorry.

The most flexible support for different IOB styles is found in CoNLLDocumentReaderAndWriter. You can have it map any IOB/IOE/... annotation done by hyphenated prefixes like your examples (B-BRAND) to any other while it is reading files with the flag:

-entitySubclassification IOB2

The resulting label set is then used for training and classification. The options are documented in the entitySubclassify() method of CoNLLDocumentReaderAndWriter: IOB1, IOB2, IOE1, IOE2, SBIEO, IO. You can find a discussion of IOB1 vs. IOB2 in Tjong Kim Sang and Veenstra 1999. By default the representation is mapped back to IOB1 on output, since that is the default used in the CoNLL conlleval program, but you can keep it as what you mapped it to with the flag:

-retainEntitySubclassification

To use this DocumentReaderAndWriter, you can give a training command like:

java8 -mx6g edu.stanford.nlp.ie.crf.CRFClassifier -prop conll.crf.chris2009.prop -readerAndWriter edu.stanford.nlp.sequences.CoNLLDocumentReaderAndWriter -entitySubclassification iob2

Alternatively, ColumnDocumentReaderAndWriter is the default DocumentReaderAndWriter which we use in the distributed models. The options you get with it are different and slightly more limited. You have these two flags:

  • -mergeTags will take either plain ("BRAND") or CoNLL-like ("I-BRAND") labels and map them down to a prefix-less IO label ("BRAND") and use that for training and classifying.
  • -iobTags can take either plain ("BRAND") or CoNLL-like ("I-BRAND") labels and maps them to IOB2.

In a sequence model, for any of the labeling schemes like IOB2, the labels are different classes. That is how these labeling schemes work. The special interpretation of "I-", "B-", etc. is left to the human observer and entity-level evaluation software. The included evaluation software will work with IOB1, IOB2, or prefixless IO encoding only.

查看更多
登录 后发表回答