How to filter a huge list of ids from Solr at runt

2019-08-17 05:03发布

问题:

I have an index for products is Solr. I need to serve a customized list of products for each customer such that I have to exclude some specific products for each customer. Currently I am storing this relationship of customer & excluded products in a SQL database and then filtering them in Solr using a terms query. Is there a way I can store this relationship in Solr itself so that I dont have to calculate the exclude list every time from SQL first.

Something very similar to what we can do in elasticsearch using https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-terms-query.html

Possible ways I could think of doing in Solr:

  1. Keeping a list of customers in the products index itself, and filter on that. But this will really be a pain if I have to reindex all the documents. Also the list can be huge.

  2. Another way that I could think of is maintaining a separate core for keeping documents per customer and excluded product_id and perform a join using {!join} to filter out products for a customer. Is it a scalable solution.

What should be the ideal approach for storing such kind of data in Solr.

回答1:

Are there any performance issues with the SQL DB? It is perfectly fine to query the DB and get the IDs, and send them to Solr. You would avoid complexity and data duplication. You would anyway have to do some computation to send those IDs to Solr as well.

But to answer your question, yes, you could store the excluded product IDs per customer indeed in a separate index. You would be using a multi-valued field and update using atomic updates. If you do that, make sure to keep the indexing schema simple with no analyzer used for the IDs (just use the string type without any tokenizer or filter).

You do not need to do a Solr join query. You only have to lookup the product IDs per customer (1st query) and massage them as CSV, and do the terms query with the product IDs retrieved from the index (2nd query).



回答2:

You need to find the best compromise for you

Best Query Time Performances You add a field (multi valued) to the product index : allowed_users ( or forbidden_users) depending on the cardinality ( that you want to minimise). This would require a re-indexing for the first time and an index update for each User permission change. In order to reduce the network traffic and optimise the updates you could take a look to atomic updates[1] .

Best Index Time Performances If the previous approach is not feasible in your case or doesn't satisfy you, you could try to optimise the indexing side. You can index a document in a separate collection :

<Id>
<product_id>
<user_id>

You can use the query time join to filter the collection for the current user and then get back the products to filter them on your query.

So basically, you already thought about both the ideas :)

[1] https://lucene.apache.org/solr/guide/6_6/updating-parts-of-documents.html