I've got millions of items ordered by a precomputed score. Each item has many boolean attributes. Let says that there is about ten thousand possible attributes totally, each item having dozen of them.
I'd like to be able to request in realtime (few milliseconds) the top n items given ~any combination of attributes.
What solution would you recommend? I am looking for something extremely scalable.
--
- We are currently looking at mongodb and array index, do you see any limitation ?
- SolR is a possible solution but we do not need text search capabilities.
This is exactly what Mongo can deal with. The fact that your attributes are boolean type helps here. A possible schema is listed below:
Then we can index on true_tags and false_tags. And it should be efficient to search with $in, $all, ... query operators.
Redis would be a perfect candidate for
Redis has a built in data structure that you can start from:
Sorted Set
=> every member of a Sorted Set is associated with score. Which for example can be ranked by score with ZRANGEBYSCORE:I encourage you to look at Sorted Set commands and get a feel for Redis, as your problem (as it is stated) asks for it. You may of course keep as many attributes as you'd like within a single Set element.
As far as MongoDB, since you mentioned millions, unless you can bent incremental queries to work for your problem, I would not expect a sub second response.
As @nickdos mentioned Solr Relevancy is a quite powerful feature, but the number of attributes will be a problem, since it would need to keep all this attributes in memory for each item. Although a dozen for each may not be that bad => just try and see.
Mongodb can handle what you want, if you stored your objects like this
Then the following query will match all the items that have att1 and attr2
but this won't match it
the query returns a cursor, if you want this cursor to be sorted, then just add the sort parameters to the query like so
Have a look at Advanced Queries to see what's possible.
Appropriate indexes can be setup as follows
And you can get performance information using
Mongo explains how many objects were scanned, how long the operation took and various other statistics.