I would like to get some feedback and suggestions regarding two approaches I'm considering to implementing searchable indexes using Redis sorted sets.
Situation and objective
We currently have some key-value tables we're storing in Cassandra, and which we would like to have indexes for. For example, one table would contain records of people, and the Cassandra table would have id as its primary key, and the serialized object as the value. The object would have fields such as first_name, last_name, last_updated, and others.
What we want is to be able to do searches such as "last_name = 'Smith' AND first_name > 'Joel'" , "last_name < 'Aaronson'" , "last_name = 'Smith' AND first_name = 'Winston'" and so on. The searches should yield the ids of matches so we can then retrieve the objects from Cassandra. I'm thinking the above searches could be done with a single index, sorted lexicographically by last_name, first_name, and last_updated. If we need some searches using a different order (e.g. "first_name = 'Zeus'") we can have a similar index that would allow those (e.g. first_name, last_updated).
We are looking at using Redis for this, because we need to be able to handle a large number of writes per minute. I've read up on some common ways Redis sorted sets are used, and come up with two possible implementations:
Option 1: a single sorted set per index
For our index by last_name, first_name, last_updated, we would have a sorted set in Redis under the key indexes:people:last_name:first_name:last_updated , which would contain strings with the format last_name:first_name:last_updated:id . For example:
smith:joel:1372761839.444:0azbjZRHTQ6U8enBw6BJBw
(For the separator I might use '::' rather than ':' or something else to work better with the lexicographic ordering, but let's ignore that for now)
The items would all be given score 0 so that the sorted set will just be sorted lexicographically by the strings themselves. If I then want to do a query like "last_name = 'smith' AND first_name < 'bob'", I would need to get all the items in the list that come before 'smith:bob'.
As far as I can tell, there are the following drawbacks to this approach:
- There is no Redis function to select a range based on the string value. This feature, called ZRANGEBYLEX, has been proposed by Salvatore Sanfilippo at https://github.com/antirez/redis/issues/324 , but is not implemented, so I would have to find the endpoints using binary searches and get the range myself (perhaps using Lua, or at the application-level with Python which is the language we're using to access Redis).
- If we want to include a time-to-live for index entries, it seems the simplest way to do it would be having a regularly scheduled task which goes through the whole index and removes expired items.
Option 2: small sorted sets, sorted by last_updated
This approach would be similar, except we would have many, smaller, sorted sets, with each having a time-like value such as last_updated for the scores. For example, for the same last_name, first_name, last_updated index, we would have a sorted set for each last_name, first_name combination. For example, the key might be indexes:people:last_name=smith:first_name=joel , and it would have an entry for each person we have called Joel Smith. Each entry would have as its name the id and its score the last_updated value. E.g.:
value: 0azbjZRHTQ6U8enBw6BJBw ; score: 1372761839.444
The main advantages to this are (a) searches where we know all the fields except last_updated would be very easy, and (b) implementing a time-to-live would be very easy, using the ZREMRANGEBYSCORE.
The drawback, which seems very large to me is:
- There seems to be a lot more complexity in managing and searching this way. For example, we would need the index to keep track of all its keys (in case, for example, we want to clean up at some point) and do this in a hierarchical manner. A search such as "last_name < 'smith'" would require first looking at the list of all the last names to find those which come before smith, then for each of those looking at all the first names it contains, then for each of those getting all the items from its sorted set. In other words, a lot of components to build up and worry about.
Wrapping up
So it seems to me the first option would be better, in spite of its drawbacks. I would very much appreciate any feedback regarding these two or other possible solutions (even if they're that we should use something other than Redis).
I strongly discourage the use of Redis for this. You'll be storing a ton of extra pointer data, and if you ever decide you want to do more complicated queries like,
SELECT WHERE first_name LIKE 'jon%'
you're going to run into trouble. You'll also need to engineer extra, very big indexes that cross multiple columns, in case you want to search for two fields at the same time. You'll essentially need to keep hacking away and reengineering a search framework. You'd be much better off using Elastic Search or Solr, or any of the other frameworks already built to do what you're trying to do. Redis is awesome and has lots of good uses. This is not one of them.Warning aside, to answer your actual question: I think you'd be best served using a variant of your first solution. Use a single sorted set per index, but just convert your letters to numbers. Convert your letters to some decimal value. You can use the ASCII value, or just assign each letter to a 1-26 value in lexicographic order, assuming you're using English. Standardize, so that each letter takes up the same numeric length (so, if 26 is your biggest number, 1 would be written "01"). Then just append these together with a decimal point in front and use that as your score per index (i.e. "hat" would be ".080120"). This will let you have a properly ordered 1-to-1 mapping between words and these numbers. When you search, convert from letters to numbers, and then you'll be able to use all of Redis' nice sorted set functions like
ZRANGEBYSCORE
without needing to rewrite them. Redis' functions are written very, very optimally, so you're much better off using them whenever possible instead of writing your own.You can check redblade, it can maintenance index automatically for you and it's written by Node.JS.
You could use my project python-stdnet for that, it does all the indexing for you. For example:
Once a the model is registered with a redis backend, you can do this:
as well as
and much more
The filtering is fast as all ids are already in sets.