On a site powered by Sitecore 6.2, I need to give the user the ability to selectively exclude items from search results.
To accomplish this, I have added a checkbox field entitled "Include in Search Results", and I created a custom database crawler to check that field's value:
~\App_Config\Include\Search Indexes\Website.config:
<search>
<configuration type="Sitecore.Search.SearchConfiguration, Sitecore.Kernel" singleInstance="true">
<indexes hint="list:AddIndex">
<index id="website" singleInstance="true" type="Sitecore.Search.Index, Sitecore.Kernel">
...
<locations hint="list:AddCrawler">
<master type="MyProject.Lib.Search.Indexing.CustomCrawler, MyProject">
...
</master>
<!-- Similar entry for web database. -->
</locations>
</index>
</indexes>
</configuration>
</search>
~\Lib\Search\Indexing\CustomCrawler.cs:
using Lucene.Net.Documents;
using Sitecore.Search.Crawlers;
using Sitecore.Data.Items;
namespace MyProject.Lib.Search.Indexing
{
public class CustomCrawler : DatabaseCrawler
{
/// <summary>
/// Determines if the item should be included in the index.
/// </summary>
/// <param name="item"></param>
/// <returns></returns>
protected override bool IsMatch(Item item)
{
if (item["include in search results"] != "1")
{
return false;
}
return base.IsMatch(item);
}
}
}
What's interesting is, if I rebuild the index using the Index Viewer application, everything behaves as normal. Items whose "Include in Search Results" checkbox is not checked will not be included in the search index.
However, when I use the search index rebuilder in the Sitecore Control Panel application or when the IndexingManager auto-updates the search index, all items are included, regardless of the state of their "Include in Search Results" checkbox.
I've also set numerous breakpoints in my custom crawler class, and the application never hits any of them when I rebuild the search index using the built-in indexer. When I use Index Viewer, it does hit all the breakpoints I've set.
How do I get Sitecore's built-in indexing processes to respect my "Include in Search Results" checkbox?
I spoke with Alex Shyba yesterday, and we were able to figure out what was going on. There were a couple of problems with my configuration that was preventing everything from working correctly:
As Seth noted, there are two distinct search APIs in Sitecore. My configuration file was using both of them. To use the newer API, only the
sitecore/search/configuration
section needs to be set up (In addition to what I posted in my OP, I was also adding indexes insitecore/indexes
andsitecore/databases/database/indexes
, which is not correct).Instead of overriding
IsMatch()
, I should have been overridingAddItem()
. Because of the way Lucene works, you can't update a document in place; instead, you have to first delete it and then add the updated version.When
Sitecore.Search.Crawlers.DatabaseCrawler.UpdateItem()
runs, it checksIsMatch()
to see if it should delete and re-add the item. IfIsMatch()
returns false, the item won't be removed from the index even if it shouldn't be there in the first place.By overriding
AddItem()
, I was able to instruct the crawler whether the item should be added to the index after its existing documents had already been removed. Here is what the updated class looks like:~\Lib\Search\Indexing\CustomCrawler.cs:
Alex also pointed out that some of my scalability settings were incorrect. Specifically:
The
InstanceName
setting was empty, which can cause problems on ephemeral (cloud) instances where the machine name might change between executions. We changed this setting on each instance to have a constant and distinct value (e.g.,CMS
andCD
).The
Indexing.ServerSpecificProperties
setting needs to betrue
so that each instance maintains its own record of when it last updated its search index.The
EnableEventQueues
setting needs to betrue
to prevent race conditions between the search indexing and cache flush processes.When in development, the
Indexing.UpdateInterval
should be set to a relatively small value (e.g.,00:00:15
). This is not great for production environments, but it cuts down on the amount of waiting you have to do when troubleshooting search indexing problems.Make sure the history engine is turned on for each web database, including remote publishing targets:
To manually rebuild the search indexes on CD instances, since there is no access to the Sitecore backend, I also installed RebuildDatabaseCrawlers.aspx (from this article).
I think I've figured out a halfway solution.
Here's an interesting snippet from
Sitecore.Shell.Applications.Search.RebuildSearchIndex.RebuildSearchIndexForm.Builder.Build()
, which is invoked by the search index rebuilder in the Control Panel application:database.Indexes
contains a set ofSitecore.Data.Indexing.Index
, which do not use a database crawler to rebuild the index!In other words, the built-in search indexer uses a completely different class when rebuilding the search index that ignores the search configuration settings in
web.config
entirely.To work around this, I changed the following files: ~\App_Config\Include\Search Indexes\Website.config:
~\Lib\Search\Indexing\CustomIndex.cs:
The only caveat to this method is that it will rebuild the index for every database, not just the selected one (which I'm guessing is why Sitecore has two completely separate methods for rebuilding indexes).
Sitecore 6.2 uses the both old and newer search api, hence the differneces in how the index gets built I believe. CMS 6.5 (soon to be released) just uses the newer api - e.g., Sitecore.Search