What does NER model to find person names inside a

2019-03-19 08:29发布

i just have started with Stanford CoreNLP, I would like to build a custom NER model to find persons.

Unfortunately, I did not find a good ner model for italian. I need to find these entities inside a resume/CV document.

The problem here is that document like those can have different structure, for example i can have:

CASE 1

- Name: John

- Surname: Travolta

- Last name: Travolta

- Full name: John Travolta

(so many labels that can represent the entity of the person i need to extract)

CASE 2

My name is John Travolta and I was born ...

Basically, i can have structured data (with different labels) or a context where i should find these entities.

What is the best approach for this kind of documents? Can a maxent model work in this case?


EDIT @vihari-piratla

At the moment, i adopt the strategy to find a pattern that has something on the left and something on the right, following this method i have 80/85% to find the entity.

Example:

Name: John
Birthdate: 2000-01-01

It means that i have "Name:" on the left of the pattern and a \n on the right (until it finds the \n). I can create a very long list of patterns like those. I thought about patterns because i do not need names inside "other" context.

For example, if the user writes other names inside a job experience i do not need them. Because i am looking for the personal name, not others. With this method i can reduce false positives because i will look at specific patterns not "general names".

A problem with this method is that i have a big list of patterns (1 pattern = 1 regex), so it does not scale so well if i add others.

If i can train a NER model with all those patterns it will be awesome, but i should use tons of documents to train it well.

4条回答
姐就是有狂的资本
2楼-- · 2019-03-19 08:56

you can use Stanford NLP.for example here is some python code that uses nltk and stanford mlp libraries

docText="your input string goes here"

words = re.split("\W+",docText) 

stops = set(stopwords.words("english"))

#remove stop words from the list
words = [w for w in words if w not in stops and len(w) > 2]

str = " ".join(words)
print str
stn = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz') 
stp = StanfordPOSTagger('english-bidirectional-distsim.tagger') 
stanfordPosTagList=[word for word,pos in stp.tag(str.split()) if pos == 'NNP']

print "Stanford POS Tagged"
print stanfordPosTagList
tagged = stn.tag(stanfordPosTagList)
print tagged

this should give you all proper nouns in the input string

查看更多
smile是对你的礼貌
3楼-- · 2019-03-19 09:13

The traditional (and probably best) approach for Case 1 is to write document segmentation code, whereas Case 2 is what most systems are designed for. You can search google scholar for "document segmentation" to get some ideas for the "best" approach. The most commonly implemented (and easiest to do) is to simply use regular expressions which can be highly effective if the document structure is consistent. Other approaches are more complex but are usually needed when there is more diversity in document structure.

Your NER Pipeline at a minimum will need:

  1. Pre-processing / text tokenization. Start with just a few simple tokenization rules
  2. Document segmentation (colons, dashes, spotting headers, any forms, etc..). I would start with regular expressions for this.
  3. POS tagging (preferably using something off the shelf like TreeTagger that has worked with Italian)
  4. NER, a MaxEnt model will work, some important features for this would be capitalization, POS tags and probably dictionary features (Italian phonebook?). You will need some labelled data.
查看更多
闹够了就滚
4楼-- · 2019-03-19 09:19

The first case could be trivial, and I agree with Ozborn's suggestion.

I would like to make a few suggestions for case-2.
Stanford NLP provides an excellent English name recognizer, but may not be able to find all the person names. OpenNLP also gives a decent performance, but much lesser than Stanford. There are many other entity recognizers available for English. I will focus here on StanfordNLP, here are a few things to consider.

  1. Gazettes. You can provide the model with a list of names and also customize how the Gazette entries are matched. Stanford also provides a sloppy match option when set, will allow partial matches with the Gazette entries. Partial matches should work well with the person names.

  2. Stanford recognizes entities constructively. If in a document, a name like "John Travolta" is recognized, then it would also get "Travolta" in the same document even if it had no prior idea about "Travolta". So, append as much information to the document as possible. Add the names recognized in case-1, in a familiar context like "My name is John Travolta." if "John Travolta" is recognized by the rules employed in case-1. Adding dummy sentences can improve the recall.

Making a benchmark for training is a very costly and boring process; you should annotate in the order of tens of thousands of sentences for decent test performance. I am sure that even if you have a model trained on annotated training data, the performance won't be any better than when you have the two steps above implemented.

@edit

Since the asker of this question is interested in unsupervised pattern-based approaches, I am expanding my answer to discuss these.

When supervised data is not available, a method called bootstrapped pattern-learning approach is generally used. The algorithm starts with a small set of seed instances of interest (like a list of books) and outputs more instances of the same type.
Refer the following resources for more information

  • SPIED is a software that uses the above-described technique and is available for download and use.
  • Sonal Gupta received Ph.D. on this topic, her dissertation is available here.
  • For a light introduction on this topic, see these slides.

Thanks

查看更多
我欲成王,谁敢阻挡
5楼-- · 2019-03-19 09:19

If it is resume/CV type document you are talking about, then the best bet is to build a corpus or start with a reduced "accuracy" expectation and build the corpus dynamically by teaching the system as users use your system. May it be OpenNLP or StanfordNLP or any other. Within limitations of my "learnings" , NER's are not really matured enough for Resume/CV type documents for English type by itself.

查看更多
登录 后发表回答