This is a continuation of the project outlined in this question.
I have the following model:
class Product {
public string Id { get; set; }
public string[] Specs { get; set; }
public int CategoryId { get; set; }
}
The "Specs" array stores product specification name value pairs joined by a special character. For example if a product is colored blue the spec string would be "Color~Blue". Representing specs in this way allows querying for products having multiple spec values specified by a query. There are two principal queries that I would like to support:
- Get all products in a given category.
- Get all products in a given category which have a set of specified specs.
This works well with RavenDB. However, in addition to the products satisfying a given query I would like to return a result set which contains all spec name-value pairs for the set of products specified by the query. The spec name-value pairs should be grouped by the name and value of the spec and contain a count of products which have a given spec name-value pair. For query #1 I created the following map reduce index:
class CategorySpecGroups {
public int CategoryId { get; set; }
public string Spec { get; set; }
public int Count { get; set; }
}
public class SpecGroups_ByCategoryId : AbstractIndexCreationTask<Product, CategorySpecGroups>
{
public SpecGroups_ByCategoryId()
{
this.Map = products => from product in products
where product.Specs != null
from spec in product.Specs
select new
{
CategoryId = product.CategoryId,
Spec = spec,
Count = 1
};
this.Reduce = results => from result in results
group result by new { result.CategoryId, result.Spec } into g
select new
{
CategoryId = g.Key.CategoryId,
Spec = g.Key.Spec,
Count = g.Sum(x => x.Count)
};
}
}
I can then query this index and get all spec name-value pairs in a given category. The problem I am running into is to get the same result set but for a query which filters both by a category and a set of spec name-value pairs. When using SQL this result set would be obtained by doing a group by over a set of products filtered by category and specs. In general, this type of query is expensive but when filtering by both category and specs the product sets are normally small, though not small enough to fit into a single page - they may contain up to 1000 products. For reference, MongoDB supports a group method which can be used to achieve the same result set. This performs the ad hoc grouping server side and the performance is acceptable.
How can I get this type of result set using RavenDB?
One possible solution is to get all the products for a query and perform the grouping in memory and another option is to create a mapreduce index as above, though the challenge with this would be deducing all possible spec selections that can be made for a given category and additionally, this type of index might explode in size.
For an example, take a look at this fastener category page. The user can filter their selection by selecting attributes. When an attribute is selected it narrows the selection of products and displays the attributes within the new set of products. This type of interaction is typically called faceted search.
EDIT
In the meantime, I will be attempting a solution using Solr as they support faceted search out of the box.
EDIT 2
It appears that RavenDB also supports faceted search (which of course makes sense, indexes are stored by Lucene just like Solr). I will be exploring this and post updates.
EDIT 3
The RavenDB faceted search functionality works as expected. I store a facet setup document for each category ID which is used to calculate facets for a query within a given category. The issue I am having now is performance. For a collection of 500k products with 4500 distinct categories resulting in 4500 facet setup documents a query by category id takes about 16 seconds when also querying for facets and about 0.05 seconds when not querying for facets. The particular category tested contains about 6k products, 23 distinct facets and 2k distinct facet name-range combinations. After looking at the code in FacetedQueryRunner it appears a facets query will result in a Lucene query for every facet name-value combination to get the counts, as a well as a query for each facet name to get the terms. One problem with the implementation is that it will retrieve all the distinct terms for a given facet name regardless of the query, which in most cases will significantly reduce the number of terms for a facet and therefore reduce the number of Lucene queries. One way to improve performance here would be to store a MapReduce computed result set (as shown above) for each facet setup document which could then be queried to get all the distinct terms when further filtering by facets. The overall performance however may still be too slow.