Suppose I have documents in this format:
product_name TEXT tags TAG score NUMERIC
[product1, [tag1, tag2, tag3], 10]
[product2, [tag2, tag3, tag4], 100]
....
I want a query to return the tags in the order of the highest sum of product score and also the top 5 of the products for each tag:
[tag3, 110, [product2, product 1]]
[tag2, 110, [product2, product 1]]
[tag4, 100, [product2]]
[tag1, 10, [product 1]]
What I have so far is storing each product/tag key separately (repeated for each tag) so for each product we have one separate doc for each tag and the id is combination of product name and tag: product_name TEXT tag TAG score NUMERIC
.
Now I can run an aggregate query to get the list of the top tags:
FT.AGGREGATE product_tags *
GROUP BY 1 @TAG
REDUCE SUM 1 @score as total_score
SORT BY 2 @total_score DESC
This will give me the top tags in order but if I want to get top 5 products for each tag I found there is only REDUCE TOLIST 1 @product_name
which will return all the products not sorted and there is REDUCE FIRST_VALUE 4 @product_name BY @score DESC
which will return only the first top product.
Is there any way to get let's say 5 top products for each tag in one query. If not is it possible to change the document storage format (or add additional one) in a way to make this kind of query possible or with as little queries as possible?
Shouldn't matter but I am using python Redisearch client.
First:
- Make sure to disable features you won't use (
NOOFFSETS
, NOHL
,
NOFREQS
, STOPWORDS 0
)
- Use
SORTABLE
for your NUMERIC
score
.
Here is the schema I used to test:
FT.CREATE product_tags NOOFFSETS NOHL NOFREQS STOPWORDS 0
SCHEMA product_name TEXT tags TAG score NUMERIC SORTABLE
You want to think of FT.AGGREGATE
as a pipeline.
The first step will be to sort the products by @score, so that later, down in the pipeline, when we REDUCE TOLIST 1 @product_name
, the list comes out sorted:
SORTBY 2 @score DESC
I think you are already doing LOAD
/APPLY
to deal with the tags, as TAG
fields would otherwise be grouped by the full comma-separated string tags-list, per product. See Allow GROUPBY on tag fields issue. So our next step is in the pipeline is:
LOAD 1 @tags
APPLY split(@tags) as TAG
We then group by @TAG, and apply the two reductions. Our products list will come out sorted.
GROUPBY 1 @TAG
REDUCE SUM 1 @score AS total_score
REDUCE TOLIST 1 @product_name AS products
Finally, we sort by @total_score
:
SORTBY 2 @total_score DESC
Here a final view of the command:
FT.AGGREGATE product_tags *
SORTBY 2 @score DESC
LOAD 1 @tags
APPLY split(@tags) as TAG
GROUPBY 1 @TAG
REDUCE SUM 1 @score AS total_score
REDUCE TOLIST 1 @product_name AS products
SORTBY 2 @total_score DESC
Here a full list of commands to illustrate the result. I used productXX
with score XX
to easily verify visually the sorting of products.
> FT.CREATE product_tags NOOFFSETS NOHL NOFREQS STOPWORDS 0 SCHEMA product_name TEXT tags TAG score NUMERIC SORTABLE
OK
> FT.ADD product_tags pt:product10 1 FIELDS product_name product10 tags tag2,tag3,tag4 score 10
OK
> FT.ADD product_tags pt:product1 1 FIELDS product_name product1 tags tag1,tag2,tag3 score 1
OK
> FT.ADD product_tags pt:product100 1 FIELDS product_name product100 tags tag2,tag3 score 100
OK
> FT.ADD product_tags pt:product5 1 FIELDS product_name product5 tags tag1,tag4 score 5
OK
> FT.SEARCH product_tags *
1) (integer) 4
2) "pt:product5"
3) 1) "product_name"
2) "product5"
3) "tags"
4) "tag1,tag4"
5) "score"
6) "5"
4) "pt:product100"
5) 1) "product_name"
2) "product100"
3) "tags"
4) "tag2,tag3"
5) "score"
6) "100"
6) "pt:product1"
7) 1) "product_name"
2) "product1"
3) "tags"
4) "tag1,tag2,tag3"
5) "score"
6) "1"
8) "pt:product10"
9) 1) "product_name"
2) "product10"
3) "tags"
4) "tag2,tag3,tag4"
5) "score"
6) "10"
> FT.AGGREGATE product_tags * SORTBY 2 @score DESC LOAD 1 @tags APPLY split(@tags) as TAG GROUPBY 1 @TAG REDUCE SUM 1 @score AS total_score REDUCE TOLIST 1 @product_name AS products SORTBY 2 @total_score DESC
1) (integer) 4
2) 1) "TAG"
2) "tag2"
3) "total_score"
4) "111"
5) "products"
6) 1) "product100"
2) "product10"
3) "product1"
3) 1) "TAG"
2) "tag3"
3) "total_score"
4) "111"
5) "products"
6) 1) "product100"
2) "product10"
3) "product1"
4) 1) "TAG"
2) "tag4"
3) "total_score"
4) "15"
5) "products"
6) 1) "product10"
2) "product5"
5) 1) "TAG"
2) "tag1"
3) "total_score"
4) "6"
5) "products"
6) 1) "product5"
2) "product1"
You are getting the full list of products sorted, not just the top 5. Complexity-wise it makes no difference, we paid the price. The impact is in buffering, network payload, and your client.
You can limit to top 5 using a Lua script:
eval "local arr = redis.call('FT.AGGREGATE', KEYS[1], '*', 'SORTBY', '2', '@score', 'DESC', 'LOAD', '1', '@tags', 'APPLY', 'split(@tags)', 'as', 'TAG', 'GROUPBY', '1', '@TAG', 'REDUCE', 'SUM', '1', '@score', 'AS', 'total_score', 'REDUCE', 'TOLIST', '1', '@product_name', 'AS', 'products', 'SORTBY', '2', '@total_score', 'DESC') \n for i=2,(arr[1]+1) do \n arr[i][6] = {unpack(arr[i][6], 1, ARGV[1])} \n end \n return arr" 1 product_tags 5
Here a friendly view of the Lua script above:
local arr = redis.call('FT.AGGREGATE', KEYS[1], ..., 'DESC')
for i=2,(arr[1]+1) do
arr[i][6] = {unpack(arr[i][6], 1, ARGV[1])}
end
return arr
We are passing one key (the index) and one argument (the limit for top products, 5 in your case): 1 product_tags 3
.
With this, we limited the impact to buffering only, saved network payload and load on your client.