The query 'laser~' doesn't find 'laser'.
I'm using Lucene's GermanAnalyzer
to store documents in the index. I save two documents with "title" fields "laser" and "labor" respectively. Afterwards I perform a fuzzy query laser~
. Lucene only finds the document that contains "labor". What is the Lucene-3x way to implement such searches?
By taking a look at the Lucene source code, I guess that fuzzy searches are not designed to work with "analyzed" content, but I'm not sure whether this is the case.
Following, some background and remarks...
OpenCms
I noticed this behaviour after someone recently noticed that our OpenCms' searches were missing documents in the results page. The searches were failing in some German site. Investigating a bit, I found that:
- We are using OpenCms 8.5.1 to perform our searches, and this uses Lucene 3.6.1 to implement the search functionality.
- By default, OpenCms uses the
org.apache.lucene.analysis.de.GermanAnalyzer
for sites with German locale to parse content and queries. - We are storing the sites content with
Field.Index.ANALYZED
- For the reported failing search, we were forcing a fuzzy search by appending a tilde to the search query.
Example code
To try to narrow the problem, I wrote some code directly exercising Lucene 3.6.1 (I have tested the 3.6.2 also, but both behave identical). Notice that Lucene 4+ has a slightly different API and a different fuzzy search, that is, in Lucene 4+ this problem doesn't arise. (Unfortunately, I cannot control the Lucene version that OpenCms depends on.)
// For the import clauses, see below
public static void main(String[] args) throws Exception {
final Version VER = Version.LUCENE_36;
// With the StandardAnalyzer or the EnglishAnalyzer
// the search works as expected
Analyzer analyzer = new GermanAnalyzer(VER);
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(VER, analyzer);
IndexWriter w = new IndexWriter(index, config);
addDoc(w, "labor");
addDoc(w, "laser");
addDoc(w, "latex");
w.close();
String querystr = "laser~"; // Fuzzy search for 'title'
Query q = new QueryParser(VER, "title", analyzer).parse(querystr);
System.out.println("Querystr: " + querystr + "; Query: " + q);
int hitsPerPage = 10;
IndexReader reader = IndexReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(
hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
System.out.println("Found " + hits.length + " hits.");
for (int i = 0; i < hits.length; ++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get("title"));
}
}
private static void addDoc(IndexWriter w, String title) throws Exception {
Document doc = new Document();
doc.add(new Field("title", title, Field.Store.YES, Field.Index.ANALYZED));
w.addDocument(doc);
}
The output of this code:
Querystr: laser~; Query: title:laser~0.5 <br>
Found 2 hits.<br>
1. labor<br>
2. latex<br>
I deliberately cut the imports section to not clutter the code. To build the project, you need lucene-core-3.6.2.jar
, lucene-analyzers-3.6.2.jar
(that you can download from the Apache archives) and following imports:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.de.GermanAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopScoreDocCollector;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
Some Lucene debugging details and remarks
When debugging the Lucene code, I found that Lucene with the
GermanAnalyzer
stores the document titles in the index as:- 'laser' -> 'las'
- 'labor' -> 'labor'
- 'latex' -> 'latex'
I also found that using an exact search
laser
, the query string is also analyzed. The output of the previous code for thelaser
query is:Querystr: laser; Query: title:las Found 1 hits. 1. laser
(Notice the different queries in the two runs:
title:laser~0.5
in the first runs vstitle:las
in the second.)As already commented, with the
StandardAnalyzer
or theEnglishAnalyzer
the fuzzy searchs works as expected:Querystr: laser~; Query: title:laser~0.5 Found 3 hits. 1. laser 2. labor 3. latex
Lucene calculates the similarity between two terms (in
org.apache.lucene.search.FuzzyTermEnum.similarity(target: String)
) relative to the length of the shortest term.Similarity
returns:[...]
1 - (editDistance / length)
where length is the length of the shortest term (text or target) including a prefix that are identical and editDistance is the Levenshtein distance for the two words.Notice that:
similarity("laser","las" ) = 1 - (2 / 3) = 1/3 similarity("laser","labor") = 1 - (2 / 5) = 3/5
Edit 1. Excluding "laser" explicitly from the analyzer also yields the expected search results:
Analyzer analyzer = new GermanAnalyzer(VER, null, new HashSet() { { add("laser"); } });
output:
Querystr: laser~; Query: title:laser~0.5 Found 3 hits. 1. laser 2. labor 3. latex