可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have a database of companies. My application receives data that references a company by name, but the name may not exactly match the value in the database. I need to match the incoming data to the company it refers to.

For instance, my database might contain a company with name "A. B. Widgets & Co Ltd." while my incoming data might reference "AB Widgets Limited", "A.B. Widgets and Co", or "A B Widgets".

Some words in the company name (A B Widgets) are more important for matching than others (Co, Ltd, Inc, etc). It's important to avoid false matches.

The number of companies is small enough that I can maintain a map of their names in memory, ie. I have the option of using Java rather than SQL to find the right name.

How would you do this in Java?

回答1:

You could standardize the formats as much as possible in your DB/map & input (i.e. convert to upper/lowercase), then use the Levenshtein (edit) distance metric from dynamic programming to score the input against all your known names.

You could then have the user confirm the match & if they don't like it, give them the option to enter that value into your list of known names (on second thought--that might be too much power to give a user...)

回答2:

Although this thread is a bit old, I recently did an investigation on the efficiency of string distance metrics for name matching and came across this library:

https://code.google.com/p/java-similarities/

If you don't want to spend ages on implementing string distance algorithms, I recommend to give it a try as the first step, there's a ~20 different algorithms already implemented (incl. Levenshtein, Jaro-Winkler, Monge-Elkan algorithms etc.) and its code is structured well enough that you don't have to understand the whole logic in-depth, but you can start using it in minutes.

(BTW, I'm not the author of the library, so kudos for its creators.)

回答3:

You can use an LCS algorithm to score them.

I do this in my photo album to make it easy to email in photos and get them to fall into security categories properly.

LCS code
Example usage (guessing a category based on what people entered)

回答4:

I'd do LCS ignoring spaces, punctuation, case, and variations on "co", "llc", "ltd", and so forth.

回答5:

Have a look at Lucene. It's an open source full text search Java library with 'near match' capabilities.

回答6:

Your database may suport the use of Regular Expressions (regex) - see below for some tutorials in Java - here's the link to the MySQL documentation (as an example):

http://dev.mysql.com/doc/refman/5.0/en/regexp.html#operator_regexp

You would probably want to store in the database a fairly complex regular express statement for each company that encompassed the variations in spelling that you might anticipate - or the sub-elements of the company name that you would like to weight as being significant.

You can also use the regex library in Java

JDK 1.4.2
http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html

JDK 1.5.0
http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Matcher.html

Using Regular Expressions in Java
http://www.regular-expressions.info/java.html

The Java Regex API Explained
http://www.sitepoint.com/article/java-regex-api-explained/

You might also want to see if your database supports Soundex capabilities (for example, see the following link to MySQL)
http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_soundex

回答7:

vote up 1 vote down

You can use an LCS algorithm to score them.

I do this in my photo album to make it easy to email in photos and get them to fall into security categories properly.

* LCS code
* Example usage (guessing a category based on what people entered)

to be more precise, better than Least Common Subsequence, Least Common Substring should be more precise as the order of characters is important.

回答8:

You could use Lucene to index your database, then query the Lucene index. There are a number of search engines built on top of Lucene, including Solr.