How to find an analyzed term with a fuzzy (approxi

2019-07-15 07:20发布

问题:

The query 'laser~' doesn't find 'laser'.

I'm using Lucene's GermanAnalyzer to store documents in the index. I save two documents with "title" fields "laser" and "labor" respectively. Afterwards I perform a fuzzy query laser~. Lucene only finds the document that contains "labor". What is the Lucene-3x way to implement such searches?

By taking a look at the Lucene source code, I guess that fuzzy searches are not designed to work with "analyzed" content, but I'm not sure whether this is the case.

Following, some background and remarks...


OpenCms

I noticed this behaviour after someone recently noticed that our OpenCms' searches were missing documents in the results page. The searches were failing in some German site. Investigating a bit, I found that:

  • We are using OpenCms 8.5.1 to perform our searches, and this uses Lucene 3.6.1 to implement the search functionality.
  • By default, OpenCms uses the org.apache.lucene.analysis.de.GermanAnalyzer for sites with German locale to parse content and queries.
  • We are storing the sites content with Field.Index.ANALYZED
  • For the reported failing search, we were forcing a fuzzy search by appending a tilde to the search query.

Example code

To try to narrow the problem, I wrote some code directly exercising Lucene 3.6.1 (I have tested the 3.6.2 also, but both behave identical). Notice that Lucene 4+ has a slightly different API and a different fuzzy search, that is, in Lucene 4+ this problem doesn't arise. (Unfortunately, I cannot control the Lucene version that OpenCms depends on.)

// For the import clauses, see below
public static void main(String[] args) throws Exception {
    final Version VER = Version.LUCENE_36;
    // With the StandardAnalyzer or the EnglishAnalyzer
    // the search works as expected
    Analyzer analyzer = new GermanAnalyzer(VER);

    Directory index = new RAMDirectory();
    IndexWriterConfig config = new IndexWriterConfig(VER, analyzer);

    IndexWriter w = new IndexWriter(index, config);
    addDoc(w, "labor");
    addDoc(w, "laser");
    addDoc(w, "latex");
    w.close();

    String querystr = "laser~"; // Fuzzy search for 'title'
    Query q = new QueryParser(VER, "title", analyzer).parse(querystr);
    System.out.println("Querystr: " + querystr + "; Query: " + q);

    int hitsPerPage = 10;
    IndexReader reader = IndexReader.open(index);
    IndexSearcher searcher = new IndexSearcher(reader);
    TopScoreDocCollector collector = TopScoreDocCollector.create(
            hitsPerPage, true);
    searcher.search(q, collector);
    ScoreDoc[] hits = collector.topDocs().scoreDocs;

    System.out.println("Found " + hits.length + " hits.");
    for (int i = 0; i < hits.length; ++i) {
        int docId = hits[i].doc;
        Document d = searcher.doc(docId);
        System.out.println((i + 1) + ". " + d.get("title"));
    }
}

private static void addDoc(IndexWriter w, String title) throws Exception {
    Document doc = new Document();
    doc.add(new Field("title", title, Field.Store.YES, Field.Index.ANALYZED));
    w.addDocument(doc);
}

The output of this code:

Querystr: laser~; Query: title:laser~0.5 <br>
Found 2 hits.<br>
1. labor<br>
2. latex<br>

I deliberately cut the imports section to not clutter the code. To build the project, you need lucene-core-3.6.2.jar, lucene-analyzers-3.6.2.jar (that you can download from the Apache archives) and following imports:

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.de.GermanAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopScoreDocCollector;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;

Some Lucene debugging details and remarks

  1. When debugging the Lucene code, I found that Lucene with the GermanAnalyzer stores the document titles in the index as:

    • 'laser' -> 'las'
    • 'labor' -> 'labor'
    • 'latex' -> 'latex'
  2. I also found that using an exact search laser, the query string is also analyzed. The output of the previous code for the laser query is:

    Querystr: laser; Query: title:las
    Found 1 hits.
    1. laser
    

    (Notice the different queries in the two runs: title:laser~0.5 in the first runs vs title:las in the second.)

  3. As already commented, with the StandardAnalyzer or the EnglishAnalyzer the fuzzy searchs works as expected:

    Querystr: laser~; Query: title:laser~0.5
    Found 3 hits.
    1. laser
    2. labor
    3. latex
    
  4. Lucene calculates the similarity between two terms (in org.apache.lucene.search.FuzzyTermEnum.similarity(target: String)) relative to the length of the shortest term. Similarity returns:

    [...]
    1 - (editDistance / length)
    where length is the length of the shortest term (text or target) including a prefix that are identical and editDistance is the Levenshtein distance for the two words.

    Notice that:

    similarity("laser","las"  ) = 1 - (2 / 3) = 1/3
    similarity("laser","labor") = 1 - (2 / 5) = 3/5
    
  5. Edit 1. Excluding "laser" explicitly from the analyzer also yields the expected search results:

    Analyzer analyzer = new GermanAnalyzer(VER, null, new HashSet() {
        {
            add("laser");
        }
    });
    

    output:

    Querystr: laser~; Query: title:laser~0.5
    Found 3 hits.
    1. laser
    2. labor
    3. latex
    

回答1:

It turns out* that prior to the 3.6 branch, the query doesn't go through the Analyzer (the component that performs stemming and lowercasing). In the 3.6 branch, some filters has been added to the query analyzer chain (e.g. the LowerCaseFilterFactory). And finally, the GermanNormalizationFilterFactory has been added to this chain in the 4.0 branch.

* Thanks @femtoRgon for your pointers

An older article explains with an example why fuzzy searches were not passed through the Analyzer:

The reason for skipping the Analyzer is that if you were searching for "dogs*" you would not want "dogs" first stemmed to "dog", since that would then match "dog*", which is not the intended query.

The bottom line is that if staying with Lucene 3.6.2, the user has to implement the analysis of the query herself.