We are using Cassandra as the data historian for our fleet management solution. We have a table in Cassandra , which stores the details of journey made by the vehicle. The table structure is as given below
CREATE TABLE journeydetails(
bucketid text,
vehicleid text,
starttime timestamp,
stoptime timestamp,
travelduration bigint,
PRIMARY KEY (bucketid,vehicleid,starttime,travelduration)
);
Where:
- bucketid :- partition key which is a combination of month and year
- vehicleid : -unique id of the vehicle
- starttime :- start time of the journey
- endtime :- endtime of the journey
- travelduration:- duration of travel in milliseconds
We would like to run the following query - get all the travels of a vehicle - 1234567 between 2015-12-1 and 2015-12-3 whose travel duration is greater than 30 minutes
When I run this query:
select * from journeydetails where bucketid in('2015-12') and vehicleid in('1234567')
and starttime > '2015-12-1 00:00:00' and starttime < '2015-12-3 23:59:59'
and travelduration > 1800000;
I get this result:
InvalidRequest: code=2200 [Invalid query] message="Clustering column "travelduration"
cannot be restricted (preceding column "starttime" is restricted by a non-EQ relation)
Does anyone have a recommendation on how to fix this issue?
select * from journeydetails where bucketid in('2015-12') and vehicleid in('1234567')
and starttime > '2015-12-1 00:00:00' and starttime < '2015-12-3 23:59:59'
and travelduration > 1800000;
That's not going to work. The reason goes back to how Cassandra stores data on-disk. The idea with Cassandra is that it is very efficient at returning a single row with a precise key, or at returning a continuous range of rows from the disk.
Your rows are partitioned by bucketid
, and then sorted on disk by vehicleid
, starttime
, and travelduration
. Because you are already executing a range query (non-EQ relation) on starttime
, you cannot restrict the key that follows. This is because the travelduration
restriction may disqualify some of the rows in your range condition. This would result in an inefficient, non-continuous read. Cassandra is designed to protect you from writing queries (such as this), which may have unpredictable performance.
Here are two alternatives:
1- If you could restrict all of your key columns prior to travelduration
(with an equals relation), then you could apply a your greater-than condition:
select * from journeydetails where bucketid='2015-12' and vehicleid='1234567'
and starttime='2015-12-1 00:00:00' and travelduration > 1800000;
Of course, restricting on an exact starttime
may not be terribly useful.
2- Another approach would be to omit travelduration
altogether, and then your original query would work.
select * from journeydetails where bucketid='2015-12' and vehicleid='1234567'
and starttime > '2015-12-1 00:00:00' and starttime < '2015-12-3 23:59:59';
Unfortunately, Cassandra does not offer a large degree of query flexibility. Many people have found success using a solution like Spark (alongside Cassandra) to achieve this level of reporting.
And just a side note, but don't use IN
unless you have to. Querying with IN
is similar to using a secondary index, in that Cassandra has to talk to several nodes to satisfy your query. Calling it with a single item probably isn't too big of a deal. But IN
is one of those old RDBMS habits that you should really break before getting too deep into Cassandra.