I would like to get some feedback and suggestions regarding two approaches I'm considering to implementing searchable indexes using Redis sorted sets.
Situation and objective
We currently have some key-value tables we're storing in Cassandra, and which we would like to have indexes for. For example, one table would contain records of people, and the Cassandra table would have id as its primary key, and the serialized object as the value. The object would have fields such as first_name, last_name, last_updated, and others.
What we want is to be able to do searches such as "last_name = 'Smith' AND first_name > 'Joel'" , "last_name < 'Aaronson'" , "last_name = 'Smith' AND first_name = 'Winston'" and so on. The searches should yield the ids of matches so we can then retrieve the objects from Cassandra. I'm thinking the above searches could be done with a single index, sorted lexicographically by last_name, first_name, and last_updated. If we need some searches using a different order (e.g. "first_name = 'Zeus'") we can have a similar index that would allow those (e.g. first_name, last_updated).
We are looking at using Redis for this, because we need to be able to handle a large number of writes per minute. I've read up on some common ways Redis sorted sets are used, and come up with two possible implementations:
Option 1: a single sorted set per index
For our index by last_name, first_name, last_updated, we would have a sorted set in Redis under the key indexes:people:last_name:first_name:last_updated , which would contain strings with the format last_name:first_name:last_updated:id . For example:
smith:joel:1372761839.444:0azbjZRHTQ6U8enBw6BJBw
(For the separator I might use '::' rather than ':' or something else to work better with the lexicographic ordering, but let's ignore that for now)
The items would all be given score 0 so that the sorted set will just be sorted lexicographically by the strings themselves. If I then want to do a query like "last_name = 'smith' AND first_name < 'bob'", I would need to get all the items in the list that come before 'smith:bob'.
As far as I can tell, there are the following drawbacks to this approach:
- There is no Redis function to select a range based on the string value. This feature, called ZRANGEBYLEX, has been proposed by Salvatore Sanfilippo at https://github.com/antirez/redis/issues/324 , but is not implemented, so I would have to find the endpoints using binary searches and get the range myself (perhaps using Lua, or at the application-level with Python which is the language we're using to access Redis).
- If we want to include a time-to-live for index entries, it seems the simplest way to do it would be having a regularly scheduled task which goes through the whole index and removes expired items.
Option 2: small sorted sets, sorted by last_updated
This approach would be similar, except we would have many, smaller, sorted sets, with each having a time-like value such as last_updated for the scores. For example, for the same last_name, first_name, last_updated index, we would have a sorted set for each last_name, first_name combination. For example, the key might be indexes:people:last_name=smith:first_name=joel , and it would have an entry for each person we have called Joel Smith. Each entry would have as its name the id and its score the last_updated value. E.g.:
value: 0azbjZRHTQ6U8enBw6BJBw ; score: 1372761839.444
The main advantages to this are (a) searches where we know all the fields except last_updated would be very easy, and (b) implementing a time-to-live would be very easy, using the ZREMRANGEBYSCORE.
The drawback, which seems very large to me is:
- There seems to be a lot more complexity in managing and searching this way. For example, we would need the index to keep track of all its keys (in case, for example, we want to clean up at some point) and do this in a hierarchical manner. A search such as "last_name < 'smith'" would require first looking at the list of all the last names to find those which come before smith, then for each of those looking at all the first names it contains, then for each of those getting all the items from its sorted set. In other words, a lot of components to build up and worry about.
Wrapping up
So it seems to me the first option would be better, in spite of its drawbacks. I would very much appreciate any feedback regarding these two or other possible solutions (even if they're that we should use something other than Redis).