I'm looking for a high performance Java library for fuzzy string search.
There are numerous algorithms to find similar strings, Levenshtein distance, Daitch-Mokotoff Soundex, n-grams etc.
What Java implementations exists? Pros and cons for them? I'm aware of Lucene, any other solution or Lucene is best?
I found these, does anyone have experience with them?
SimMetrics is probably what you need: http://sourceforge.net/projects/simmetrics/
It has several algorithms for calculating various flavours of edit-distance.
Lucene is a very powerful full-text search engine, but FT search isn't exactly the same thing as fuzzy string matching (eg. given a list of strings find me the one that is most similar to some candidate string).
You can use Apache Lucene, but depending on the use case this may be too heavy weight. For very simple fuzzy searches it may be a bit complex to use and (correct me if I'm wrong) it requires you to build an index.
If you need a simple online (= not maintaining an index) algorithm you can use the fuzzy Bitap algorithm. I found an implementation in Java here. It's code fits in a single relatively short method with an almost self-explaining signature:
Apache Commons
StringUtils
has an implementation of the Levenshtein algorithm for fuzzy String matching. It can be seen as the fuzzy version ofString.equals
, Bitap is like the fuzzy version ofString.indexOf
and still uses the Levenshtein distance measure. It is generally more efficient than naively using Levenshtein to compare the search pattern with each substring that could possibly match.Notes:
ArrayIndexOutOfBoundsException
on non-ASCII characters (>= 128) so you will have to filter these out.I tried using Bimap in an application to search an in-memory list of persons by name. I found that a Levenhstein distance of 2 gives way too many false positives. A Levenhstein distance of 1 works better, but it cannot detect a typo where you swap two letters, e.g. "William" and "Willaim". I can think of a few ways to solve this, e.g.
ArrayIndexOutOfBoundsException
If you are going to do 2 or 4, it may be better to use a proper full-text search library like Lucene anyway.
BitapOnlineSearcher
, but requires you to usejava.io.Reader
together with an Alphabet class. It's Javadoc is written in Russian.You can try bitap. I was playing with bitap written in ANSI C and it was pretty fast there is java implementation in http://www.crosswire.org.
Commons Lang has an implementation of Levenshtein distance.
Commons Codec has an implementation of soundex and metaphone.
To Lucene I would add SOLR http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
If you are mostly comparing short strings and want something portable and lightweight you can use the well known python algorithm fuzzywuzzy ported to Java.
You can read more about it here