How to query Cassandra by date range

2019-04-23 07:15发布

问题:

I have a Cassandra ColumnFamily (0.6.4) that will have new entries from users. I'd like to query Cassandra for those new entries so that I can process that data in another system.

My sense was that I could use a TimeUUIDType as the key for my entry, and then query on a KeyRange that starts either with "" as the startKey, or whatever the lastStartKey was. Is this the correct method?

How does get_range_slice actually create a range? Doesn't it have to know the data type of the key? There's no declaration of the data type of the key anywhere. In the storage_conf.xml file, you declare the type of the columns, but not of the keys. Is the key assumed to be of the same type as the columns? Or does it do some magic sniffing to guess?

I've also seen reference implementations where people store TimeUUIDType in columns. However, this seems to have scale issues as this particular key would then become "hot" since every change would have to update it.

Any pointers in this case would be appreciated.

回答1:

When sorting data only the column-keys are important. The data stored is of no consequence neither is the auto-generated timestamp. The CompareWith attribute is important here. If you set CompareWith as UTF8Type then the keys will be interpreted as UTF8Types. If you set the CompareWith as TimeUUIDType then the keys are automatically interpreted as timestamps. You do not have to specify the data type. Look at the SlicePredicate and SliceRange definitions on this page http://wiki.apache.org/cassandra/API This is a good place to start. Also, you might find this article useful http://www.sodeso.nl/?p=80 In the third part or so he talks about slice ranging his queries and so on.



回答2:

Doug,

Writing to a single column family can sometimes create a hot spot if you are using an Order-Preserving Partitioner, but not if you are using the default Random Partitioner (unless a subset of users create vastly more data than all other users!).

If you sorted your rows by time (using an Order-Preserving Partitioner) then you are probably even more likely to create hotspots, since you will be adding rows sequentially and a single node will be responsible for each range of the keyspace.



回答3:

Columns and Keys can be of any type, since the row key is just the first column. Virtually, the cluster is a circular hash key ring, and keys get hashed by the partitioner to get distributed around the cluster.

Beware of using dates as row keys however, since even the randomization of the default randompartitioner is limited and you could end up cluttering your data.

What's more, if that date is changing, you would have to delete the previous row since you can only do inserts in C*.

Here is what we know :

  • A slice range is a range of columns in a row with a start value and an end value, this is used mostly for wide rows as columns are ordered. Known column names defined in the CF are indexed however so they can be retrieved specifying names.
  • A key slice, is a key associated with the sliced column range as returned by Cassandra
  • The equivalent of a where clause uses secondary indexes, you may use inequality operators there, however there must be at least ONE equals clause in your statement (also see https://issues.apache.org/jira/browse/CASSANDRA-1599).
  • Using a key range is ineffective with a Random Partitionner as the MD5 hash of your key doesn't keep lexical ordering.

What you want to use is a Column Family based index using a Wide Row : CompositeType(TimeUUID | UserID) In order for this not to become hot, add a first meaningful key ("shard key") that would split the data accross nodes such as the user type or the region.

Having more data than necessary in Cassandra is not a problem, it's how it is designed, so what you must ask yourself is "what do I need to query" and then design a Column Family for it rather than trying to fit everything in one CF like you'd do in an RDBMS.