Name Extraction - CV/Resume - Stanford NER/OpenNLP

2019-02-21 06:36发布

问题:

I'm currently on a learning project to extract an individuals name from their CV/Resume.

Currently I'm working with Stanford-NER and OpenNLP which both perform with a degree of success out of the box on, tending to struggle on "non-western" type names (no offence intended towards anybody).

My question is - given the general lack of sentence structure or context in relation to an individuals name in a CV/Resume, am I likely to gain any significant improvement in name identification by creating something akin to a CV corpus?

My initial thoughts are that I'd probably have a more success by sentence splitting, removing obvious text and applying a bit of logic to make a best guess on the individual's name.

I can see how training would work if the a name appears in within a structured sentence, however as a standalone entity without context (Akbar Agho for example) I suspect it will struggle regardless of the training.

Is there a level of AI that if given enough data would begin to formulate a pattern for finding a name or should I maybe just go for applying a level of logic based string extraction?

I'd appreciate people's thoughts, opinions and suggestions.

Side note: I having been using PHP with Appache Tika to do the initial text extraction from Doc/Pdf and am experimenting with Stanford and OpenNLP via PHP/Commandline.

Chris

回答1:

My 2cents on the problem.

So sticking to the NER taggers you listed above will be my first block in the pipeline, if I can identify things there, Viola, no need to go ahead if not then I suggest you go for a Rule Based Approach. When we speak about a resume, name of a candidate is generally in the top 10% lines of the resume. In many cases it is also followed by "Name : Ankit Solanki " If that fails try to find the email address and match that with different NP pairs you get from other text in the resume, the one with the closest match you find should be your name, As in majority of cases email address of people for professional purpose like a resume will have their name, example john.mayer89abc.com will get cleaned to john.mayer which in turn goes through a algo which will find the closest noun phrase to the cleaned email name.

Let me know your thoughts on this.

Best,

Ankit



回答2:

I guess you'll probably improve name identification if you create a CV corpus, this also depends on the size of your corpus (you could gather such a corpus by crawling CV websites).

Using data mining is probably, in my opinion, your best option. I don't know in details what options are proposed by Apache Tika, but the more information you have on the layout of the CV, the better. For instance, patterns should probably rely on the fact that names are on top of the document, and close to birth date / marital status / image / address.

In that case, you won't be any more in a sequential labelling case (as Stanford NER does): in a CV, a name is usually not surrounded by text. It should most probably be a classification task of candidates segments of text to which patterns may be converted as (numeric or binary) attributes.

Pattern extractor may easily be found or implemented and should be considered as a preprocessing before machine learning. Don't forget, indeed, to also use lists of first and last names (and frequent prefixes / suffixes : -son, -vitch, -man, Ben-, de, etc.) that are indeed unavoidable criteria to decide what segment is likely to be a name. As other names often appear in a CV, this is why I believe using layout should also be an important feature.

I'd be interested to know what features are efficient... would you let us know?