Redis filter by range, sort and return 10 first

2019-04-11 03:47发布

问题:

Assume we have a simple mysql table(user) with fields:

id
rating
salary

I want to get 10 users with highest rating and salary with specified range(50-100), i.e in mysql it would be

SELECT id from user WHERE salary>50 and salary<100 ORDER by rating limit 0, 10

This runs for 20ms on 100K users table.

Assume I have same in redis: Zlist rating (rating=>user_id) Zlist salary (salary=>user_id)

All solutions I saw with redis include copying 100k salary Zlist, removing unneeded entries, and merging with 100k rating list, like

zinterstore 1 search salary
zremrange search -inf 50
zremrange search 100 +inf
zinterstore 2 search rating weights 0 1
zrange search 0 10

which is absolutely slow(why copy 100k elements to remove most of them?).

Is there any way to implement this at least comparably efficient with redis?

回答1:

The use case you describe cannot be modeled elegantly in NoSQL solutions. It isn't a Redis limitation.

Let me explain that a bit more. You are running range queries on one field, and sorting on another. This isn't something NoSQL solutions are good at. For example, Google App Engine forbids such queries. Take a look at GAE Query Restrictions and read the section "Properties in Inequality Filters Must Be Sorted before Other Sort Orders"

To get all results that match an inequality filter, a query scans the index table for the first matching row, then returns all consecutive results until it finds a row that doesn't match. For the consecutive rows to represent the complete result set, the rows must be ordered by the inequality filter before other sort orders.

Having said that, you can still efficiently run your queries, but the solution isn't going to be elegant.

  1. Create salary ranges - 0-5000, 5000-10000, 10000-15000 and so on
  2. Create sets like users_with_salary:10000-15000. This set will contain user ids who have salary in the given range.
  3. Similarly, create sets like `users_with_rating:1-2". This set will contain user ids who have ratings in the given range
  4. Now, run the following pseudo code

String userids[];
for(rating = 10; rating > 0; rating--) {
  for(salary = min_salary; salary < max_salary; salary += 5000) {
      String salary_key = "users_with_salary:" + salary + "-" + (salary+5000);
      String rating_key = "users_with_rating:" + rating + "-" + (rating+1);

      userids.append(redis.sinter(salary_key, rating_key));

      if(userids.length > 10) {
         break;
      }
   }
}

With redis 2.6 and lua scripting, you can even run this on the lua server.

In conclusion, if you want to run complex queries on your data, it is best to model it in a relational database.



回答2:

With scripting you could use "ZRANGEBYSCORE salary 50 100" to get the users where salary is between 50 and 100 and store the result into a tmp set. Assuming you store a user's rating in a hash at key "user:[id]", you could then do "SORT tmp BY user:*->rating LIMIT 0 10".

Unfortunately you can't currently SORT BY the score associated with an entry in a zset so you would need to store your rating values either only or additionally in a separate hash to use this method.

Of course, you could also use "ZINTERSTORE tmp2 2 rating tmp WEIGHTS 1 0" and then "ZRANGE tmp2 0 10" but that would be much less efficient than using SORT because it would require the overhead of sorting all of tmp2 (as it is being created) whereas SORT with LIMIT uses the partial quicksort algorithm which effectively only sorts the 10 results actually returned. You might want to keep around tmp2 so you can quickly return other users in the range though in which case storing a temporary zset of users with salary between 50 and 100 ranked by rating might make sense.

I think the SORT method I describe is actually algorithmically as good as an SQL database could achieve. Once you use an index to filter by a range on one field, I know of no way that an index on another field could be used to improve the efficiency of sorting that small result-set. I believe an SQL database would simply use partial quicksort or equivalent to sort only the results returned.