Parsing a full name into its constituents

2019-02-13 22:15发布

We are in need of developing a back end application that can parse a full name into

Prefix (Dr. Mr. Ms. etc)
First Name
Last Name
Middle Name
etc

Challenge here is that it has to support names of multiple countries and languages. One assumption that we have is we will always get a country and language along with the full name as input.

The full name may come in any format. For the same country / language combination, it may come in with first name last name or the reverse. Comma will not be a part of the Full Name.

Is is feasible? We are also open to any commercially available software.

9条回答
爷、活的狠高调
2楼-- · 2019-02-13 22:34

I wrote a simple human name parser in javascript as an npm module:

https://www.npmjs.org/package/humanparser

humanparser

Parse a human name string into salutation, first name, middle name, last name, suffix.

Install

npm install humanparser

Usage

var human = require('humanparser');

var fullName = 'Mr. William R. Jenkins, III'
    , attrs = human.parseName(fullName);

console.log(attrs);

//produces the following output

{ saluation: 'Mr.',
  firstName: 'William',
  suffix: 'III',
  lastName: 'Jenkins',
  middleName: 'R.',
  fullName: 'Mr. William R. Jenkins, III' }
查看更多
Evening l夕情丶
3楼-- · 2019-02-13 22:39

Ask yourself: do you really need the different parts of a name? Parsing names is inherently un-doable, since different cultures use different conventions (e.g. "middle name" is a typical USA-ism) and some small percentage of names will always be treated wrongly.

It is much preferable to treat a name as an "atomic" not-splittable entity.

查看更多
forever°为你锁心
4楼-- · 2019-02-13 22:41

Since the OP was open to any commercially available offering...

The "IBM InfoSphere Global Name Analytics" appears to be a commercial solution satisfying the original request for the parsing of a [free-form unstructured] personal name [full name]; apparently with a degree of certainty in regards to resolving some of the name ambiguity issues alluded to in other responses.
Note: I have no personal experience nor association with the product, I had merely encountered this discussion and the following reference links while re-investigating effectively the same concern as described by the OP. HTH.

A general product documentation link:
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gna_con_gnaoverview.html

Refer to the "Parsing names using NameParser" at
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_np_con_parsingnamesusingnameparser.html

The NameParser is a component API for the product per
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gnm_con_logicalarchitecturecapis.html

Refer to the "Parsing names using IBM NameWorks" at
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gnm_con_parsingnamesusingnameworks.html

"IBM NameWorks combines the individual IBM InfoSphere Global Name Recognition components into a single, unified, easy-to-use application programming interface (API), and also extends this functionality to Java applications and as a Web service"

http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gnm_con_logicalarchitecturenwapis.html

To clarify why I think this answers the question, ameliorating some of the previous alluded difficulties in accomplishing the task... If I understood correctly what I read, the APIs use the "NameHunter Server" to search the "IBM InfoSphere Global Name Data Archive (NDA)" which is described as "a collection of nearly one billion names from around the world, along with gender and country of association for each name. This large repository of name information powers the algorithms and rules that IBM InfoSphere Global Name Recognition products use to categorize, classify, parse, genderize , and match names."

FWiW I also ran across a "Name Parser" which uses a database of ~140K names as noted at:
http://www.melissadata.com/dqt/websmart-web-services.htm

查看更多
放荡不羁爱自由
5楼-- · 2019-02-13 22:47

I think this is impossible. Consider Ralph Vaughan Williams. His family name is "Vaughan Williams" and his first name is "Ralph". Contrast this with Charles Villiers Stanford, whose family name is "Stanford", with first name "Charles" and middle name "Villiers".

Both are English-speaking composers from England, so country and language information is not sufficient to establish the correct parsing logic.

查看更多
时光不老,我们不散
6楼-- · 2019-02-13 22:49

A basic algorithm could do the following:

  • First see if incoming string starts with a title such as Mrs and remove it if it does, checking against a fixed list of titles.
  • If there is one space left and one space exactly, assume first word is first name and second word is surname (which will be incorrect at times)

To go beyond that would be lots of work, see How to parse full names to identify avenues for improvement and see these involved IBM docs for further implementation clues

查看更多
Lonely孤独者°
7楼-- · 2019-02-13 22:54

The only reasonable approach is to avoid having to do so in the first place. The most obvious (and common) way to do that is to have the user enter the title, first/given name, last/family name, suffix, etc., separately from each other, rather than attempting to parse them out of a single string.

查看更多
登录 后发表回答