Learnig NER using category list

2019-09-14 19:19发布

问题:

In the template for training CRF++, how can I include a custom dictionary.txt file for listed companies, another for popular European foods, for eg, or just about any category.

Then provide a sample training data for each category whereby it learns how those specific named entites are used within a context for that category.
In this way, I as well as the system, can be sure it correctly understood how certain named entites are structured in a text, whether a tweet or a Pulitzer prize winning news article, instead of providing hundred megabytes of data.

This would be rather cool. Model would have a definite dictionary of known entites (which does not need to be expanded) and a statistical approach on how those known entites are structured in human text.

PS - Just for clarity, not yearning for a regex ner. These are only cool if you got lots in the dictionary, lots of rule and lots of dulltime.

回答1:

I think what you are talking about is Gazetteers list (dictionary.txt).

You would have to include corresponding feature for a word in training data and then specify it in template file.

For Example: Your list contains the entity: Hershey's and training data has a sentence: I like Hershey's chocolates.

So when you arrange the data in CoNLL Format (for CRF++), you can add a column (which shall have values 0 or 1 , indicating is the word is present in dictionary) which will have 0 value for all words, except Hershey's. You also have to include this column as feature in template file.

To get a better understanding on Template File and NER training with CRF++, you can watch the below videos and comment your doubts :)

1) https://youtu.be/GJHeTvDkIaE

2) https://youtu.be/Ur5umC4BwN4

EDIT: (after viewing the OP's comment)

Sample Training Data with extra features: https://pastebin.com/fBgu8c67 I've added 3 features. The IsCountry feature value ( 1 or 0 ) can be obtained from a Gazetteers list of countries. The other 2 features can be computed offline. Note that Headers are added in file for reference only, should not be include in training data file.

Sample Template File for the above data : https://pastebin.com/LPvAGCVL

Note that, Test Data should also be in the same format as Train Data, with the same features / same no of columns.