Cassandra schema design sorted by time

2019-06-10 10:58发布

问题:

I'm new on cassandra data modeling, I realy need same advice, here is my problem:

I need to create a new column family that will allow me to store and retrieve last inserted scores :

CREATE TABLE average_score(
    audit_time timestamp PRIMARY KEY,
    pages_count int,
    score float,
)

The inserted data is not sorted according to primary key (i'm using a random partinioner(default)), do you have any solution please ? Can I specify a different partitionner just for this family column ?

thanks

回答1:

Here is an example of an hour-partitioned series table that might clarify some things for you:

CREATE TABLE average_score(
    hour timestamp,
    audit_time timeuuid,
    pages_count int,
    score float,
    PRIMARY KEY (hour, audit_time)
)
WITH CLUSTERING ORDER BY (audit_time DESC)
  • Because it comes first, hour is our "partition" key, i.e. it will be used to physically distribute our data across the cluster. (When you write, you will have to supply this value, rounded down to the start of the current hour.)

  • audit_time is our first "clustering" key, i.e. it is used to order and identify rows in a given hour partition on a particular node. We've chosen timeuuid to prevent overwrites. (You can pull out the actual time with the dateOf function. See http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/timeuuid_functions_r.html)

  • WITH CLUSTERING ORDER BY (audit_time DESC) directs C* to store rows within a partition in descending order on disk, which is probably the right decision if you intend on using ORDER BY audit_time DESC in most of your queries. (See http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/refClstrOrdr.html)

Caveat: Although we've partitioned the data fairly granularly, you will still have a bit of a write hotspot if you're just appending new scores as they are registered.

Cheers!

P.S. If you're still stuck, the DataStax CQL documentation is a great resource for data modeling help.



回答2:

You could store all (in theory, in practice you probably will choose some max column count) values in one single row, using the timestamp as column name. By doing this your data can be traversed in time order just by traversing the columns.

Always remember: The primary key in cassandra is NOT for sorting BUT for partitioning among the nodes of the cluster.

See also http://rubyscale.com/blog/2011/03/06/basic-time-series-with-cassandra/ for further information.



回答3:

Well you can use Ordered partitioner but that is not recommended.

Alternative Solution

If the score is for an entity(user), Then use the entity and date(day part of your timestamp) combination as your row key i.e partition key.

Column name as timestamp(TimeUUID in Cassandra) , and its value as score. While creating column families use "reversed=True" as you need recent entries.

Then you can just make a slice query to get your desired results