The query 'laser~' doesn't find 'laser'.
I'm using Lucene's GermanAnalyzer
to store documents in the index. I save two documents with "title" fields "laser" and "labor" respectively. Afterwards I perform a fuzzy query laser~
. Lucene only finds the document that contains "labor". What is the Lucene-3x way to implement such searches?
By taking a look at the Lucene source code, I guess that fuzzy searches are not designed to work with "analyzed" content, but I'm not sure whether this is the case.
Following, some background and remarks...
OpenCms
I noticed this behaviour after someone recently noticed that our OpenCms' searches were missing documents in the results page. The searches were failing in some German site. Investigating a bit, I found that:
- We are using OpenCms 8.5.1 to perform our searches, and this uses Lucene 3.6.1 to implement the search functionality.
- By default, OpenCms uses the
org.apache.lucene.analysis.de.GermanAnalyzer
for sites with German locale to parse content and queries. - We are storing the sites content with
Field.Index.ANALYZED
- For the reported failing search, we were forcing a fuzzy search by appending a tilde to the search query.
Example code
To try to narrow the problem, I wrote some code directly exercising Lucene 3.6.1 (I have tested the 3.6.2 also, but both behave identical). Notice that Lucene 4+ has a slightly different API and a different fuzzy search, that is, in Lucene 4+ this problem doesn't arise. (Unfortunately, I cannot control the Lucene version that OpenCms depends on.)
// For the import clauses, see below
public static void main(String[] args) throws Exception {
final Version VER = Version.LUCENE_36;
// With the StandardAnalyzer or the EnglishAnalyzer
// the search works as expected
Analyzer analyzer = new GermanAnalyzer(VER);
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(VER, analyzer);
IndexWriter w = new IndexWriter(index, config);
addDoc(w, "labor");
addDoc(w, "laser");
addDoc(w, "latex");
w.close();
String querystr = "laser~"; // Fuzzy search for 'title'
Query q = new QueryParser(VER, "title", analyzer).parse(querystr);
System.out.println("Querystr: " + querystr + "; Query: " + q);
int hitsPerPage = 10;
IndexReader reader = IndexReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(
hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
System.out.println("Found " + hits.length + " hits.");
for (int i = 0; i < hits.length; ++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get("title"));
}
}
private static void addDoc(IndexWriter w, String title) throws Exception {
Document doc = new Document();
doc.add(new Field("title", title, Field.Store.YES, Field.Index.ANALYZED));
w.addDocument(doc);
}
The output of this code:
Querystr: laser~; Query: title:laser~0.5 <br>
Found 2 hits.<br>
1. labor<br>
2. latex<br>
I deliberately cut the imports section to not clutter the code. To build the project, you need lucene-core-3.6.2.jar
, lucene-analyzers-3.6.2.jar
(that you can download from the Apache archives) and following imports:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.de.GermanAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopScoreDocCollector;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
Some Lucene debugging details and remarks
When debugging the Lucene code, I found that Lucene with the
GermanAnalyzer
stores the document titles in the index as:- 'laser' -> 'las'
- 'labor' -> 'labor'
- 'latex' -> 'latex'
I also found that using an exact search
laser
, the query string is also analyzed. The output of the previous code for thelaser
query is:Querystr: laser; Query: title:las Found 1 hits. 1. laser
(Notice the different queries in the two runs:
title:laser~0.5
in the first runs vstitle:las
in the second.)As already commented, with the
StandardAnalyzer
or theEnglishAnalyzer
the fuzzy searchs works as expected:Querystr: laser~; Query: title:laser~0.5 Found 3 hits. 1. laser 2. labor 3. latex
Lucene calculates the similarity between two terms (in
org.apache.lucene.search.FuzzyTermEnum.similarity(target: String)
) relative to the length of the shortest term.Similarity
returns:[...]
1 - (editDistance / length)
where length is the length of the shortest term (text or target) including a prefix that are identical and editDistance is the Levenshtein distance for the two words.Notice that:
similarity("laser","las" ) = 1 - (2 / 3) = 1/3 similarity("laser","labor") = 1 - (2 / 5) = 3/5
Edit 1. Excluding "laser" explicitly from the analyzer also yields the expected search results:
Analyzer analyzer = new GermanAnalyzer(VER, null, new HashSet() { { add("laser"); } });
output:
Querystr: laser~; Query: title:laser~0.5 Found 3 hits. 1. laser 2. labor 3. latex
It turns out* that prior to the 3.6 branch, the query doesn't go through the Analyzer (the component that performs stemming and lowercasing). In the 3.6 branch, some filters has been added to the query analyzer chain (e.g. the
LowerCaseFilterFactory
). And finally, theGermanNormalizationFilterFactory
has been added to this chain in the 4.0 branch.* Thanks @femtoRgon for your pointers
An older article explains with an example why fuzzy searches were not passed through the Analyzer:
The bottom line is that if staying with Lucene 3.6.2, the user has to implement the analysis of the query herself.