Cassandra data model for time series

2019-04-12 07:43发布

I am working on a Cassandra data model for storing time series (I'm a Cassandra newbie). I have two applications: intraday stock data and sensor data.

The stock data will be saved with a time resolution of one minute. Seven datafields build one timeframe: Symbol, Datetime, Open, High, Low, Close, Volume

I will query the data mostly by Symbol and Date. e.g. give me all data for AAPL between 2013-01-01 and 2013-01-31 ordered by Datetime. The recommendation for cassandra queries is to query whole columns. So you could create five rows with the keys Open, High, Low, Close, Volume. And for each Symbol and Minute an own column. E.g. "AAPL:2013-01-04T130400Z". This would result in a table of five rows and n*NT columns where n = number of symbols, nT = number of minutes. Most of the time I will query date ranges. I.e. all minutes of a day. So I could rearrange the data to have columns named "AAPL:2013-01-04" and rows: OpenT130400Z, HighT130400Z, LowT130400Z, CloseT130400Z, VolumeT130400Z. This would result in a table with n*nD columns (n: number of Symbols, nD: number of Days) and 5*nM rows (nM: number of minutes/entries per day).

To sum up: I have columns, which hold the information for a whole day for one symbol.

I have found a description how to deal with time series data in cassandra here http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra But I don't really get, if they use the hour (1332960000) as a column name or as a row key!? I understood they use the hour as row key and have the small timesteps as columns. So they would have a fixed column number. But that would have disadvantages in reading because I would have to do a range query on keys! Am I right?

Second question: If I have sensor data, which is much more fine grained than 1 minute stock data (let's say I have to save timesteps with a resolution of microseconds) how would I deal with this? If I use columns for saving a composite of sensor channel and hours, and rows for microseconds since the last hour this would result in 3,600,000,000 rows and n*nH columns (n: number of sensors, nH: number of Hours). I could not use the microseconds since last hour for columns because I have 3,6 billion points which is higher than the allowed number of 2 billion columns.

Did I get it? What do you think about this problem? How to solve it?

Thank you!

Best, Malte

1条回答
迷人小祖宗
2楼-- · 2019-04-12 08:23

So I have a suggestion for your first question about the stock data. A naive implementation might look like this:

RowKey:

Column Format:

Name: The current datetime granular to a minute

Value: a composite column of Open,High,Low,Close,Volume

So you would have something like

AAPL = [2013-05-02-15:38:00 | 441.78:448.59:440.63:15066146:445.52] ... [2013-05-02-15:39:00 | 441.78:448.59:440.63:15066146:445.52] ... [2013-05-02-15:40:00 | 441.78:448.59:440.63:15066146:445.52]

That would give you roughly half a million columns in one year so it might be ok for maybe 4 years. I wouldn't go and attempt to hit the 2 billion limit. What you could do is define a splitting factor on the row key. It all depends on your usage pattern, but a simple one might be on the year so the column family entry might look like this with a composite row key and that would guarantee that you always have less than a million columns per row.

AAPL:2013 = [05-02-15:38:00 | 441.78:448.59:440.63:15066146:445.52] ... [05-02-15:39:00 | 441.78:448.59:440.63:15066146:445.52] ... [05-02-15:40:00 | 441.78:448.59:440.63:15066146:445.52]

查看更多
登录 后发表回答