Cassandra bucket splitting for partition sizing

2020-03-27 06:33发布

问题:

I am quite new to Cassandra, I just learned it with Datastax courses, but I don't find enough information on bucket here or on the Internet and in my application I need to use buckets to split my data.

I have some instruments that will make measures, quite a lot, and splitting the measures daily (timestamp as partition key) might be a bit risky as we can easily reach the limit of 100MB for a partition. Each measure concerns a specific object identified with an ID. So I would like to use a bucket, but I don't know how to do.

I'm using Cassandra 3.7

Here is how my table will look like, roughly:

CREATE TABLE measures (
  instrument_id bigint,
  day timestamp,
  bucket int,
  measure_timestamp timestamp,
  measure_id uuid,
  measure_info float,
  object_id bigint,
  PRIMARY KEY ((instrument_id, day, bucket), measure_timestamp, measure_id)
);

I thought of adding the object_id as a partition key, but then I loose the "flow of measures" made by an instrument, as what interests me is seeing all the measures made by an instrument in a specific day or period of time.

  • So the question is, when I want to request all the records of a day for a specific instrument, how can I do if there is many buckets?
  • If I want the partition limit to be 400 000 rows, how can I know when inserting data in which bucket I have to insert the data?
  • Is there a way of knowing the number of buckets there is?

Thank you very much for your help!

回答1:

You should focus on your requirements, and then go back to your schema model. In your case, how many measures per day each instruments can do? If each one can do less than your 400k measures then you're already done without bucketing. If your instruments can perform up to 10M measures each, then N=10M/400k buckets should be enough to satisfy your requirements. Assuming N buckets, when you need to query all the measures coming from a particular instrument you have to perform N queries, one for each bucket, unless you can count the measures during your writes, so that you can change bucket when a bucket is full. I mean, you write the first 400k measures in the bucket 0, then you write the second 400k measures to the bucket 1, and so on. Then you need to keep track of on how many K buckets you inserted data and perform only K queries instead on N. That way you have unbalanced buckets (and partitions), but you get your results in the smallest number of queries. If you prefer a balanced-bucket approach instead, you can perform each write in a uniformly distributed random bucket number, but then you have to perform all of your N queries to get all the data of a specific instrument.