I am trying training a custom NER model for multiple entities. Here is the sample training data:
count all <START:item_type> operating tables <END> on the <START:location_id> third <END> <START:location_type> floor <END>
count all <START:item_type> items <END> on the <START:location_id> third <END> <START:location_type> floor <END>
how many <START:item_type> beds <END> are in <START:location_type> room <END> <START:location_id> 2 <END>
The NameFinderME.train(.)
method takes a string parameter type
. What is the use of this parameter? And, how can I train a model for multiple entities (e.g. item_type
, location_type
, location_id
in my case)
public static void main(String[] args) {
String trainingDataFile = "/home/OpenNLPTest/lib/training_data.txt";
String outputModelFile = "/tmp/model.bin";
String sentence = "how many beds are in the hospital";
train(trainingDataFile, outputModelFile, "location_type");
predict(sentence, outputModelFile);
}
private static void train(String trainingDataFile, String outputModelFile, String tagToFind) {
File inFile = new File(trainingDataFile);
NameSampleDataStream nss = null;
try {
nss = new NameSampleDataStream(new PlainTextByLineStream(new java.io.FileReader(inFile)));
} catch (Exception e) {}
TokenNameFinderModel model = null;
int iterations = 100;
int cutoff = 5;
try {
// Does the 'type' parameter mean the entity type that I am trying to train the model for?
// What if I need to train for multiple entities?
model = NameFinderME.train("en", tagToFind, nss, (AdaptiveFeatureGenerator) null, Collections.<String,Object>emptyMap(), iterations, cutoff);
} catch(Exception e) {}
try {
File outFile = new File(outputModelFile);
FileOutputStream outFileStream = new FileOutputStream(outFile);
model.serialize(outFileStream);
}
catch (Exception ex) {}
}
private static void predict(String sentence, String modelFile) throws Exception {
FileInputStream modelInToken = new FileInputStream("/tmp/en-token.bin");
TokenizerModel modelToken = new TokenizerModel(modelInToken);
Tokenizer tokenizer = new TokenizerME(modelToken);
String tokens[] = tokenizer.tokenize(sentence);
FileInputStream modelIn = new FileInputStream(modelFile);
TokenNameFinderModel model = new TokenNameFinderModel(modelIn);
NameFinderME nameFinder = new NameFinderME(model);
Span nameSpans[] = nameFinder.find(tokens);
double[] spanProbs = nameFinder.probs(nameSpans);
for( int i = 0; i<nameSpans.length; i++) {
System.out.println(nameSpans[i]);
}
}
The
type
argument toNameFinderME.train
is used as the default type for training data that does not include a type parameter. This is only relevant if you have a sample that looks like this:Instead of like this:
To train multiple types of entities, the developer documentation says
So you could try training on the sample from your question, which includes multiple types, and see how well it works. In this mailing list message, someone asks for the status of training for multiple types and gets this answer:
If you don't get good performance with a model that handles multiple types, the alternative would be to create multiple copies of your training data where each copy is modified to include only one type. You would then train a separate model on each set of training data. At that point you should have a (for example) item_type model, a location_type model, and a location_id model. You could then run your input through each model to detect the different types.