Excluding items selectively from Sitecore's Lu

2019-04-26 14:21发布

问题:

On a site powered by Sitecore 6.2, I need to give the user the ability to selectively exclude items from search results.

To accomplish this, I have added a checkbox field entitled "Include in Search Results", and I created a custom database crawler to check that field's value:

~\App_Config\Include\Search Indexes\Website.config:

<search>
  <configuration type="Sitecore.Search.SearchConfiguration, Sitecore.Kernel" singleInstance="true">
    <indexes hint="list:AddIndex">
      <index id="website" singleInstance="true" type="Sitecore.Search.Index, Sitecore.Kernel">
        ...

        <locations hint="list:AddCrawler">
          <master type="MyProject.Lib.Search.Indexing.CustomCrawler, MyProject">
            ...
          </master>

          <!-- Similar entry for web database. -->
        </locations>
      </index>
    </indexes>
  </configuration>
</search>

~\Lib\Search\Indexing\CustomCrawler.cs:

using Lucene.Net.Documents;
using Sitecore.Search.Crawlers;
using Sitecore.Data.Items;

namespace MyProject.Lib.Search.Indexing
{
  public class CustomCrawler : DatabaseCrawler
  {
    /// <summary>
    ///   Determines if the item should be included in the index.
    /// </summary>
    /// <param name="item"></param>
    /// <returns></returns>
    protected override bool IsMatch(Item item)
    {
      if (item["include in search results"] != "1")
      {
        return false;
      }

      return base.IsMatch(item);
    }
  }
}

What's interesting is, if I rebuild the index using the Index Viewer application, everything behaves as normal. Items whose "Include in Search Results" checkbox is not checked will not be included in the search index.

However, when I use the search index rebuilder in the Sitecore Control Panel application or when the IndexingManager auto-updates the search index, all items are included, regardless of the state of their "Include in Search Results" checkbox.

I've also set numerous breakpoints in my custom crawler class, and the application never hits any of them when I rebuild the search index using the built-in indexer. When I use Index Viewer, it does hit all the breakpoints I've set.

How do I get Sitecore's built-in indexing processes to respect my "Include in Search Results" checkbox?

回答1:

I spoke with Alex Shyba yesterday, and we were able to figure out what was going on. There were a couple of problems with my configuration that was preventing everything from working correctly:

  • As Seth noted, there are two distinct search APIs in Sitecore. My configuration file was using both of them. To use the newer API, only the sitecore/search/configuration section needs to be set up (In addition to what I posted in my OP, I was also adding indexes in sitecore/indexes and sitecore/databases/database/indexes, which is not correct).

  • Instead of overriding IsMatch(), I should have been overriding AddItem(). Because of the way Lucene works, you can't update a document in place; instead, you have to first delete it and then add the updated version.

    When Sitecore.Search.Crawlers.DatabaseCrawler.UpdateItem() runs, it checks IsMatch() to see if it should delete and re-add the item. If IsMatch() returns false, the item won't be removed from the index even if it shouldn't be there in the first place.

    By overriding AddItem(), I was able to instruct the crawler whether the item should be added to the index after its existing documents had already been removed. Here is what the updated class looks like:

    ~\Lib\Search\Indexing\CustomCrawler.cs:

    using Sitecore.Data.Items;
    using Sitecore.Search;
    using Sitecore.Search.Crawlers;
    
    namespace MyProject.Lib.Search.Indexing
    {
      public class CustomCrawler : DatabaseCrawler
      {
        protected override void AddItem(Item item, IndexUpdateContext context)
        {
          if (item["include in search results"] == "1")
          {
            base.AddItem(item, context);
          }
        }
      }
    }
    

Alex also pointed out that some of my scalability settings were incorrect. Specifically:

  • The InstanceName setting was empty, which can cause problems on ephemeral (cloud) instances where the machine name might change between executions. We changed this setting on each instance to have a constant and distinct value (e.g., CMS and CD).

  • The Indexing.ServerSpecificProperties setting needs to be true so that each instance maintains its own record of when it last updated its search index.

  • The EnableEventQueues setting needs to be true to prevent race conditions between the search indexing and cache flush processes.

  • When in development, the Indexing.UpdateInterval should be set to a relatively small value (e.g., 00:00:15). This is not great for production environments, but it cuts down on the amount of waiting you have to do when troubleshooting search indexing problems.

  • Make sure the history engine is turned on for each web database, including remote publishing targets:

    <database id="production">
      <Engines.HistoryEngine.Storage>
        <obj type="Sitecore.Data.$(database).$(database)HistoryStorage, Sitecore.Kernel">
          <param connectionStringName="$(id)" />
          <EntryLifeTime>30.00:00:00</EntryLifeTime>
        </obj>
      </Engines.HistoryEngine.Storage>
      <Engines.HistoryEngine.SaveDotNetCallStack>false</Engines.HistoryEngine.SaveDotNetCallStack>
    </database>
    

To manually rebuild the search indexes on CD instances, since there is no access to the Sitecore backend, I also installed RebuildDatabaseCrawlers.aspx (from this article).



回答2:

I think I've figured out a halfway solution.

Here's an interesting snippet from Sitecore.Shell.Applications.Search.RebuildSearchIndex.RebuildSearchIndexForm.Builder.Build(), which is invoked by the search index rebuilder in the Control Panel application:

for (int i = 0; i < database.Indexes.Count; i++)
{
  database.Indexes[i].Rebuild(database);
  ...
}

database.Indexes contains a set of Sitecore.Data.Indexing.Index, which do not use a database crawler to rebuild the index!

In other words, the built-in search indexer uses a completely different class when rebuilding the search index that ignores the search configuration settings in web.config entirely.

To work around this, I changed the following files: ~\App_Config\Include\Search Indexes\Website.config:

<indexes>
  <index id="website" ... type="MyProject.Lib.Search.Indexing.CustomIndex, MyProject">
    ...
  </index>

  ...
</indexes>

~\Lib\Search\Indexing\CustomIndex.cs:

using Sitecore.Data;
using Sitecore.Data.Indexing;
using Sitecore.Diagnostics;

namespace MyProject.Lib.Search.Indexing
{
  public class CustomIndex : Index
  {
    public CustomIndex(string name)
      : base(name)
    {
    }

    public override void Rebuild(Database database)
    {
      Sitecore.Search.Index index = Sitecore.Search.SearchManager.GetIndex(Name);
      if (index != null)
      {
        index.Rebuild();
      }
    }
  }
}

The only caveat to this method is that it will rebuild the index for every database, not just the selected one (which I'm guessing is why Sitecore has two completely separate methods for rebuilding indexes).



回答3:

Sitecore 6.2 uses the both old and newer search api, hence the differneces in how the index gets built I believe. CMS 6.5 (soon to be released) just uses the newer api - e.g., Sitecore.Search