How do you configure Lucene in Sitecore to only in

2019-04-28 23:47发布

问题:

I recognise this is a moot point on the web database, so this question applies to the master db...

I have a custom index set up in Sitecore 6.4.1 as follows:

<index id="search_content_US" type="Sitecore.Search.Index, Sitecore.Kernel">
    <param desc="name">$(id)</param>
    <param desc="folder">_search_content_US</param>
    <Analyzer ref="search/analyzer" />
    <locations hint="list:AddCrawler">
        <search_content_home type="Sitecore.Search.Crawlers.DatabaseCrawler, Sitecore.Kernel">
            <Database>master</Database>
            <Root>/sitecore/content/usa home</Root>
            <Tags>home content</Tags>
        </search_content_home>
    </locations>
</index>

I query the index like this (I am using techphoria414's SortableIndexSearchContext from this answer: How to sort/filter using the new Sitecore.Search API):

private SearchHits GetSearchResults(SortableIndexSearchContext searchContext, string searchTerm)
    {
        CombinedQuery query = new CombinedQuery();
        query.Add(new FullTextQuery(searchTerm), QueryOccurance.Must);
        return searchContext.Search(query, Sort.RELEVANCE);
    }

...

SearchHits hits = GetSearchResults(searchContext, searchTerm);

hits is a collection of search hits from my index. When I iterate through hits I can see that there are many duplicates of the same items in Sitecore, 1 per version of the item.

I then do the following to get a SearchResultCollection:

SearchResultCollection results = hits.FetchResults(0, hits.Length);

This combines all of the duplicates into a single SearchResult object. This object represents 1 version of a particular item, and has a property called SubResults which is a collection of SearchResults that represent all of the other item versions.

Here's my problem:

The version of the item represented by the SearchResult is NOT the current published version of the item! It appears to be a randomly selected version (whichever the search method hit first in the index). The latest version is included in the SubResults collection, however.

E.g.:

SearchResult
 |
 |- Version 8 // main result
 ...
 |- SubResults
      |
      |- Version 9 // latest version
      |- Version 3
      |- Version 5
      ... // all versions in random order

How do I prevent this from happening on the master db? Either by preventing Lucene from indexing old versions of items, or by doing some manipulation of the result set to get the latest version from the SubResults?

As an aside, why does Lucene bother to index old versions of items anyway? Surely this is pointless for searching content on your website as the old versions are not visible?

回答1:

You can implement a custom crawler that overrides the following:

public class IndexCrawler : DatabaseCrawler
{
    protected override void IndexVersion(Item item, Item latestVersion, Sitecore.Search.IndexUpdateContext context)
    {
        if (item.Versions.Count > 0 && item.Version.Number != latestVersion.Version.Number)
            return;

        base.IndexVersion(item, latestVersion, context);
    }
}

This ensures that only the latest version of an item gets into your Index, and therefore will be the only item pull out of said index

You would need to update your configuration file to set the correct type for the index of course



回答2:

In Sitecore 7 a field _latestversion was added to the index, containing a '1' for the latest version (other versions have empty value).



回答3:

If you let Lucene search in your Web database instead of the Master, it should only index the last published version.

<Database>web</Database>


回答4:

Although the solution provided by theyetiman, by using an adjusted sort mechanism, is an interesting approach, it does not provide a perfect solution when the Lucene result scores for the two versions tend to differ. E.g. out of v1 with score 0.7, and v2 with score 0.5, his solution will still return the first version of the item. (At least in my tests.)

After some more digging, the most obvious solution apparently lies in implementing your own Sitecore.Pipelines.Search.SearchSystemIndex and using that one instead of the default. If you decompile that code using ILSpy or similar, you will notice the following at the bottom of the Process method:

foreach (SearchResult current in searchHits.FetchResults(0, searchHits.Length)){
  // ...
}

Each such SearchResult is actually group-by, where the first result that was returned from Lucene (thus the one with the highest score) is the main result. Hits on other versions (and also other languages) of the same item are accessible through the Subresults property of each instance; or null when there are none.

Depending on your requirements, you can adjust this part of the class to fit your needs.



回答5:

Whilst I haven't figured out the exact answer (to stop Lucene indexing old versions on the master db) I have come up with an acceptable work-around...

When Lucene returns its results from the index, each hit has a field called "_id" which is formatted something like this (3 versions of the same item, where the last number is the version):

"CCB75380-4E9A-4921-99EC-65E532E330FF%en%1"
"CCB75380-4E9A-4921-99EC-65E532E330FF%en%2"
"CCB75380-4E9A-4921-99EC-65E532E330FF%en%3"
...

I'm currently sorting by Sort.RELEVANCE which is the default. This is fine if we only had one version of an item in the index, but with several almost identical versions, they all have the same relevance score and Lucene just churns them out in any order. Sitecore then takes the first instance of the item version (even if it's old).

The solution is to specify a secondary sort field. In the searchContext.Search() method, you can pass a custom Sort object.

searchContext.Search(query, new Sort(...));

By sorting by Lucene's built in Sort.RELEVANCE first, and then by the id field (descending) in the index, I can ensure that the first hit that Sitecore sees will be the latest version and not just a random one:

searchContext.Search(query, new Sort
                            (
                                new SortField[2] 
                                {
                                    SortField.FIELD_SCORE, // equivalent to Sort.RELEVANCE
                                    new SortField("_id",SortField.STRING, true) // sort by _id, descending
                                }
                            )
);

The SortField parameters are as follows:

SortField(string fieldName, int type, bool reverse)

This approach has fixed my problem, but if anyone can actually find out how to only index the latest version, please answer!