Understanding Lucene leading wildcard performance

2019-01-24 17:53发布

Lucene does not by default allow leading wildcards in search terms, but this can be enabled with:

QueryParser#setAllowLeadingWildcard(true)

I understand that use of a leading wildcard prevents Lucene from using the index. Searches with a leading wildcard must scan the entire index.

How do I demonstrate the performance of a leading wildcard query? When is it OK to use setAllowLeadingWildcard(true)?

I have built a test index with 10 million documents in the form:

{ name: random_3_word_phrase }

The index is 360M on disk.

My test queries perform well and I have been unable to actually demonstrate a performance problem. For example, querying for name:*ing produces over 1.1 million documents in less than 1 second. Querying name:*ing* produces over 1.5 million documents in the same time.

What is going here? Why isn't this slow? Is 10,000,000 documents not enough? Do the documents need to contains more than just a single field?

2条回答
Luminary・发光体
2楼-- · 2019-01-24 18:17

Depends on how much memory you have, and how much of the token index is in memory.

A 360MB total index could be searched quite quickly on any old computer. A 360GB index would take a bit longer... ;)

As an example, I fired up an old 2GB index, and searched for "*e".

On a box with 8GB, it returned 500K hits in under 5 seconds. I tried the same index on a box with only 1GB of memory, and it took about 20 seconds.

To illustrate further, here's some generic C# code that basically does a "** E*" type search of 10 million random 3 word phrases.

static string substring = "E";

private static Random random = new Random((int)DateTime.Now.Ticks);//thanks to McAden

private static string RandomString(int size)
{
    StringBuilder builder = new StringBuilder();
    char ch;
    for (int i = 0; i < size; i++)
    {
        ch = Convert.ToChar(Convert.ToInt32(Math.Floor(26 * random.NextDouble() + 65)));
        builder.Append(ch);
    }

    return builder.ToString();
}

static void FindSubStringInPhrases()
{
    List<string> index = new List<string>();

    for (int i = 0; i < 10000000; i++)
    {
        index.Add(RandomString(5) + " " + RandomString(5) + " " + RandomString(5));
    }

    var matches = index.FindAll(SubstringPredicate);

}

static bool SubstringPredicate(string item)
{
    if (item.Contains(substring))
        return true;
    else
        return false;
}

After all 10 million phases have been loaded into the list, it still only takes about a second for "var matches = index.FindAll(SubstringPredicate);" to return over 4 million hits.

The point is, memory is fast. Once things can no longer fit into memory and you have to start swapping to disk is when you are going to see performance hits.

查看更多
▲ chillily
3楼-- · 2019-01-24 18:17

If I understand it correctly, a part of the index is the term dictionary, which is a sorted list of all the indexed terms. When searching with no wildcards or a trailing wildcard, Lucene can take advantage of the fact that many terms have common prefixes. On the other hand, searching with a leading wildcard scans the entire term dictionary. This is not optimal, but the term dictionary tends to be tiny compared to other parts of the index, such as frequency and position data, so a full scan of the term dictionary is usually not big problem by itself.

查看更多
登录 后发表回答