I've got millions of items ordered by a precomputed score. Each item has many boolean attributes.
Let says that there is about ten thousand possible attributes totally, each item having dozen of them.
I'd like to be able to request in realtime (few milliseconds) the top n items given ~any combination of attributes.
What solution would you recommend? I am looking for something extremely scalable.
--
- We are currently looking at mongodb and array index, do you see any limitation ?
- SolR is a possible solution but we do not need text search capabilities.
Mongodb can handle what you want, if you stored your objects like this
{ score:2131, attributes: ["attr1", "attr2", "attr3"], ... }
Then the following query will match all the items that have att1 and attr2
c = db.mycol.find({ attributes: { $all: [ "attr1", "attr2" ] } })
but this won't match it
c = db.mycol.find({ attributes: { $all: [ "attr1", "attr4" ] } })
the query returns a cursor, if you want this cursor to be sorted, then just add the sort parameters to the query like so
c = db.mycol.find({ attributes: { $all: [ "attr1", "attr2" ] }}).sort({score:1})
Have a look at Advanced Queries to see what's possible.
Appropriate indexes can be setup as follows
db.mycol.ensureIndex({attributes:1, score:1})
And you can get performance information using
db.mycol.find({ attributes: { $all: [ "attr1" ] }}).explain()
Mongo explains how many objects were scanned, how long the operation took
and various other statistics.
This is exactly what Mongo can deal with. The fact that your attributes are boolean type helps here. A possible schema is listed below:
[
{
true_tags:[attr1, attr2, attr3, ...],
false_tags: [attr4, attr5, attr6, ...]
},
]
Then we can index on true_tags and false_tags. And it should be efficient to search with $in, $all, ... query operators.
Redis would be a perfect candidate for
- "the top n items" for "millions of items ordered by score"
Redis has a built in data structure that you can start from: Sorted Set
=> every member of a Sorted Set is associated with score. Which for example can be ranked by score with ZRANGEBYSCORE:
ZRANGEBYSCORE key min max [WITHSCORES] [LIMIT offset count]
I encourage you to look at Sorted Set commands and get a feel for Redis, as your problem (as it is stated) asks for it. You may of course keep as many attributes as you'd like within a single Set element.
As far as MongoDB, since you mentioned millions, unless you can bent incremental queries to work for your problem, I would not expect a sub second response.
As @nickdos mentioned Solr Relevancy is a quite powerful feature, but the number of attributes will be a problem, since it would need to keep all this attributes in memory for each item. Although a dozen for each may not be that bad => just try and see.