I am trying to implement location(cities, regions, countries, objects) fuzzy search using Solr server. Currently, my index contains about 0.8-1.0 M items. It works really well using fuzzy search (~0.7) but is too slow for me (0.2-0.6 sec very often). The tokenizer that is used is <tokenizer class="solr.StandardTokenizerFactory"/>
. As an alternative I tried <tokenizer class="solr.WhitespaceTokenizerFactory"/>
- it is great in terms of performance (about 100x faster) but it does not offer fuzzy search:(
Do you know any different approach I could use? I would like to benefit using fuzzy search feature but in a much faster way, if possible.
Thanks a lot!
Your problem is not related to the analyzer that you use. When you search for Califrna~0.7 Lucene iterates over all terms in index and calculates the (Levenshtein) edit distance between "Califrna" and all terms. This is a very expensive operation.
This issue will be solved with Lucene version 4.0. Lucene version that comes with Solr is using old brute force approach unfortunately.
https: //issues.apache.org/jira/browse/LUCENE-2089
http: //java.dzone.com/news/lucenes-fuzzyquery-100-times
If it is OK for you, I would suggest to download Solr/Lucene from trunk and test how the new fuzzy query works.
http://wiki.apache.org/solr/NightlyBuilds
Even though trunk is stable it is not recommended for production use. I can suggest you two similar methods:
1 - SpellChecker
http://wiki.apache.org/solr/SpellCheckComponent
http ://www.lucidimagination.com/blog/2010/08/31/getting-started-spell-checking-with-apache-lucene-and-solr/
SpellChecker builds its small index with n-grams in order to perform fast lookup. It is also using Levenshtein distance but instead of iterating on all terms it only calculates the distance on related terms.
You need to first execute spell checker for "Califrna" and it will suggest you "Californa". Then you can use "California" in your query on your main index without fuzzy query.
2- Auto Suggest
http ://wiki.apache.org/solr/Suggester
You can offer the correct spelling as user type query with suggester component. This will be a lot faster. It support fuzzy search with JaspellLookup class. JaspellLookup needs to be updated in order to enable fuzzy search. Wiki does not say much about what needs to be updated though. if usePrefix is set to false it should perform fuzzy lookup I guess.