Is it possible to train Stanford NER system to rec

I'm using some NLP libraries now, (stanford and nltk) Stanford I saw the demo part but just want to ask if it possible to use it to identify more entity types.

So currently stanford NER system (as the demo shows) can recognize entities as person(name), organization or location. But the organizations recognized are limited to universities or some, big organizations. I'm wondering if I can use its API to write program for more entity types, like if my input is "Apple" or "Square" it can recognize it as a company.

Do I have to make my own training dataset?

Further more, if I ever want to extract entities and their relationships between each other, I feel I should use the stanford dependency parser. I mean, extract first the named entities and other parts tagged as "noun" and find relations between them.

Am I correct.

Thanks.

标签： nlp stanford-nlp named-entity-recognition

3条回答

Rolldiameter

2楼-- · 2019-03-08 11:38

You could easily train your own corpus of data.

In the Stanford NER FAQ the first question is how to train our own model for NER

The link is http://nlp.stanford.edu/software/crf-faq.shtml

So for example You could give training data like

Product OBJ
of O
Microsoft ORG

Likewise you could build your own training data and build a model and then use it to get the desired output

0人赞添加讨论(0) 举报

聊天终结者

3楼-- · 2019-03-08 11:46

Yes, you need your own training set. The pre-trained Stanford models only recognise the word "Stanford" as a named entity because they have been trained on data that had that word (or very similar words according to the feature set they use, I don't know what that is) marked as a named entity.

Once you have more data, you need to put it in the right format described in this question and the Stanford tutorial.

0人赞添加讨论(0) 举报

对你真心纯属浪费

4楼-- · 2019-03-08 11:55

Seems you want to train your custom NER model.

Here is a detailed tutorial with full code:

https://dataturks.com/blog/stanford-core-nlp-ner-training-java-example.php?s=so

Training data format

Training data is passed as a text file where each line is one word-label pair. Each word in the line should be labeled in a format like "word\tLABEL", the word and the label name is separated by a tab '\t'. For a text sentence, we should break it down into words and add one line for each word in the training file. To mark the start of the next line, we add an empty line in the training file.

Here is a sample of the input training file:

hp  Brand
spectre ModelName
x360    ModelName

home    Category
theater Category
system  0

horizon ModelName
zero    ModelName
dawn    ModelName
ps4 0

Depending upon your domain, you can build such a dataset either automatically or manually. Building such a dataset manually can be really painful, tools like a NER annotation tool can help make the process much easier.

Train model

public void trainAndWrite(String modelOutPath, String prop, String trainingFilepath) {
   Properties props = StringUtils.propFileToProperties(prop);
   props.setProperty("serializeTo", modelOutPath);

   //if input use that, else use from properties file.
   if (trainingFilepath != null) {
       props.setProperty("trainFile", trainingFilepath);
   }

   SeqClassifierFlags flags = new SeqClassifierFlags(props);
   CRFClassifier<CoreLabel> crf = new CRFClassifier<>(flags);
   crf.train();

   crf.serializeClassifier(modelOutPath);
}

Use the model to generate tags:

public void doTagging(CRFClassifier model, String input) {
    input = input.trim();
    System.out.println(input + "=>"  +  model.classifyToString(input));
}

Hope this helps.

0人赞添加讨论(0) 举报

Is it possible to train Stanford NER system to rec

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间