How to model Cassandra DB for Time Series, server

2019-07-26 20:51发布

问题:

My name is Daniel, I'm a newcomer accountwise but a long time lurker. I decided to learn Apache Cassandra for my next "lets write some code while the kids are sleeping" project.

What i'm writing is a neat little api that will do read and writes against a cassandra database. I had a lot of the db layout figured out in mongodb, but for me it's time to move on and grow as a engineer :)

Mission: I will collect metrics from the servers in my rack, an agent will send a payload of metrics every minute. I have the api part pretty much figured out, will use JWT tokens signing the payloads. The type of data i will store can be seen below. cpuload, cpuusage, memusage, diskusage etc.

The part where i am confused with cassandra is how to write the actual model, i understand the storagengines sort of writes it all as a time serie on disk for me making reads quite amazing. i know anything i would whip together now would work for my lab since it's jsut 30 machines, but i'm trying to understand how these things are done properly and how it could be done for a real life scenario like server density, datadog , "insert your prefered server monitoring service". :)

But how are you more experienced engineers designing a schema like this ?

Usage scenarios for the database:

  • write payloads every minute through the api. (lets imagine thats atleast 100k writes per minute for the sake of learning something useful)
  • Read the assets associated with ones userid

    • pull most recent data (3h)
    • pull most recent data (daily)
    • pull most recent data (weekly)
    • pull most recent data (monthly)
    • etc etc
  • Generate monthly pdf reports showing uptime and such.

Should i insert the rows containing the full payload or am i better of inserting them per service basis: timeuid|cpuusage
Per service row

CREATE TABLE metrics(
    id uuid PRIMARY KEY,
    assetid int,
    serviceType text,
    metricValue int
)

All in one

CREATE TABLE metrics(
    id uuid PRIMARY KEY,
    assetid int,
    cpuload int,
    cpuusage int,
    memusage int,
    diskusage int,
)

In mongo i would preallocate the buckets, and also keep a quick read avg inside of the document. So in the webgui i could simply show the avg stats for pre-defined time periods.

Examples for dumbasses are highly appreciated. Hope you can decipher my rather poor english.

Just found this url in the SO suggestions: Cassandra data model for time series i guess that is something that applies to me aswell.

Sincerly Daniel Olsson

回答1:

For your data model, I would suggest adding time as a clustering column:

CREATE TABLE metrics(
id uuid,
time timeuuid,
assetid int,
cpuload int,
cpuusage int,
memusage int,
diskusage int,
PRIMARY KEY (id, time) WITH CLUSTERING ORDER BY (time DESC))

Use descending order to keep the latest metrics first. You can then query using the LIMIT clause to get the most recent hour:

SELECT * FROM metrics WHERE id = <UUID> LIMIT 60

Or day:

SELECT * FROM metrics WHERE id = <UUID> LIMIT 1440

Depending upon how long you plan to keep the data, you may want to add a column for year, month, or days to the table to limit your partition size. For example, if you wish to keep data for 3 months, a month column can be added to partition your keys by id and month:

CREATE TABLE metrics(
id uuid,
time timeuuid,
month text,
assetid int,
cpuload int,
cpuusage int,
memusage int,
diskusage int,
PRIMARY KEY ((id, month), time) WITH CLUSTERING ORDER BY (time DESC))

If you keep data for several years, use year + month or a date value.

Regarding your final question, about separate tables or a single table. Cassandra supports sparse columns, so you can make multiple inserts in a common table for each metric without updating any data. However, it's always faster to write just once per row.

You may need separate tables if you have to query for different metrics by an alternative key. For example, query for disk usage by id and disk name. You'd need a separate table or a materialized view to support that query pattern.

Finally, your schema defines an assetid, but this isn't defined in your primary key so with your current schema you can't query using assetid.