I am doing time series data modelling where I have a start date and end date of events. I need to query on that data model like the following:
Select * from tablename where startdate>'2012-08-09' and enddate<'2012-09-09'
I referred to the following link on cql where clause but I couldn't achieve this. Any way to do that? I can also change the data model or any cql tweaks. I am using Cassandra 2.1.
I had to solve a similar problem in one of my former positions. This is one way in which you could accomplish this...
There are two modeling problems preventing this query from working. First of all, to run a range query, you first need to limit your query with a partition key. With time series data the best idea is to create something called a time bucket. For this example I'll partition the data by month, with a partition key called
monthbucket
.The other problem, is that you can only run a range query on a single column/key value. This becomes problematic when you want to query by both a start and end date. One solution, is to store each row in the table twice, and create an additional clustering key to hold the value of whether the row is the beginning row or the end row. I'll just call this column
beginend
.Given those notes, I'll create a table that looks like this:
eventDate
in DESCending order.eventid
in this case).After INSERTing some rows, let's just query by a partition key of September, 2015:
Similar to your example, let's say that I want to query events between September 18th and September 24th:
As you can see, I should get three rows: A beginning and an end row for "Talk Like A Pirate Day" and a beginning row for the 2015 Cassandra Summit.
As with all data modeling approaches, there are trade-offs to be made. In this case to model for querying on both dates, the trade-off is that you have to duplicate your rows. And of course, to be able to range query at all, you have to decide on a good partition key (
monthbucket
) that offers relevant data and the required query flexibility. In any case, give it a try and see if you can make it work for your use case.Edit to answer questions:
That's where you'd need to figure out the best time bucket for your application. Think about your most-common queries, and model off of that. Now you don't want to store too much in a single row (bucket), because that will kill your data distribution. So try to find a happy medium between query flexibility and data distribution.
In this particular case with
monthBucket
you'd have to execute a query for each individual month. The application that I designed this solution for never looked at an entire years' worth of events at once. If that's a query pattern you need to support, then you'll need to make your time bucket a little bigger.Nope. Duplicates would need to be handled/ignored at the application level. Cassandra CQL does have a DISTINCT keyword, but it only functions on partition keys.
No, Cassandra does not have a way to JOIN tables together. And application-side joins are possible, but don't perform well and are technically an anti-pattern.
Handling data on the application-side (whether joining or filtering) is typically not a good idea. But the key is moderation. If you query 20 events and have to ignore dupes for some of them, that's not too big of a deal. But querying 20,000,000 events and applying an application-side process at that volume is not going to scale well at all. Again, this is where you have to look at the options available, and decide what will work for your application.