可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I just wonder if ClickHouse can be used for storing time-series data in the case like this: schema with columns: "some_entity_id", "timestamp", "metric1", "metric2", "metric3", ..., "metricN". Where each new column containing metric name can be added to the table dynamically, while adding entry with this metric name.

Have not found any information about dynamical table extend in official documentation.

So can this case be implemented in Clickhouse?

UPD: After some benchmarks we found out that ClickHouse writes new data faster than our current time-series storage, but reads data much more slower.

回答1:

It would probably be better to modify your schema to have 4 columns:

"some_entity_id", "timestamp", "metric_name", "metric_value"

You can include "metric_name" in the MergeTree index, to improve performance when searching for a specific metric of an entity. Test with and without it, to see if it's useful for the kind of queries you make.

回答2:

There are more than one ways to use CH as a time series database. My personal preference is to use one string array for metric names and one Float64 array for metric values.

This is a sample time series table:

CREATE TABLE ts1(
    entity String,
    ts UInt64, -- timestamp, milliseconds from January 1 1970
    m Array(String), -- names of the metrics
    v Array(Float32), -- values of the metrics
    d Date MATERIALIZED toDate(round(ts/1000)), -- auto generate date from ts column
    dt DateTime MATERIALIZED toDateTime(round(ts/1000)) -- auto generate date time from ts column
) ENGINE = MergeTree(d, entity, 8192)

Here we are loading two metrics (load, temperature) for an entity(cpu):

INSERT INTO ts1(entity, ts, m, v) 
VALUES ('cpu', 1509232010254, ['load','temp'], [0.85, 68])

And querying cpu load:

SELECT 
    entity, 
    dt, 
    ts, 
    v[indexOf(m, 'load')] AS load
FROM ts1 
WHERE entity = 'cpu'

┌─entity─┬──────────────────dt─┬────────────ts─┬─load─┐
│ cpu    │ 2017-10-28 23:06:50 │ 1509232010254 │ 0.85 │
└────────┴─────────────────────┴───────────────┴──────┘

Get data as array of tuples:

SELECT 
    entity, 
    dt, 
    ts, 
    arrayMap((mm, vv) -> (mm, vv), m, v) AS metrics
FROM ts1 

┌─entity─┬──────────────────dt─┬────────────ts─┬─metrics─────────────────────┐
│ cpu    │ 2017-10-28 23:06:50 │ 1509232010254 │ [('load',0.85),('temp',68)] │
└────────┴─────────────────────┴───────────────┴─────────────────────────────┘

Get data as lines of tuples:

SELECT 
    entity, 
    dt, 
    ts, 
    arrayJoin(arrayMap((mm, vv) -> (mm, vv), m, v)) AS metric
FROM ts1 

┌─entity─┬──────────────────dt─┬────────────ts─┬─metric────────┐
│ cpu    │ 2017-10-28 23:06:50 │ 1509232010254 │ ('load',0.85) │
│ cpu    │ 2017-10-28 23:06:50 │ 1509232010254 │ ('temp',68)   │
└────────┴─────────────────────┴───────────────┴───────────────┘

Get lines with the metric you want:

SELECT 
    entity, 
    dt, 
    ts, 
    arrayJoin(arrayMap((mm, vv) -> (mm, vv), m, v)) AS metrics
FROM ts1 
WHERE metrics.1 = 'load'

┌─entity─┬──────────────────dt─┬────────────ts─┬─metrics───────┐
│ cpu    │ 2017-10-28 23:06:50 │ 1509232010254 │ ('load',0.85) │
└────────┴─────────────────────┴───────────────┴───────────────┘

Get metric names and values as columns:

SELECT 
    entity, 
    dt, 
    ts, 
    arrayJoin(arrayMap((mm, vv) -> (mm, vv), m, v)) AS metric, 
    metric.1 AS metric_name, 
    metric.2 AS metric_value
FROM ts1 

┌─entity─┬──────────────────dt─┬────────────ts─┬─metric────────┬─metric_name─┬─metric_value─┐
│ cpu    │ 2017-10-28 23:06:50 │ 1509232010254 │ ('load',0.85) │ load        │         0.85 │
│ cpu    │ 2017-10-28 23:06:50 │ 1509232010254 │ ('temp',68)   │ temp        │           68 │
└────────┴─────────────────────┴───────────────┴───────────────┴─────────────┴──────────────┘

Since CH has lots of useful date and time functions, along with higher order functions and tuples, I think it's almost a natural time-series database.

回答3:

Edit:

WARNING

After using this method myself with several table I have observed querying columns with the Array(Tuple(String,String,String)) Definition seems to crash the database on large tables (1+ billion rows), so take this with a grain of salt, what I have described here may well be UB, but I've gotten not official word on it from the devs yet

Original answer:

You can Alter tables, but not dynamically.

Also once a column is added you always need to insert new stuff into it, though you can always have a "default" value.

That being said... I found myself needed to dynamically insert values and there is a "Hack" for doing so, namely using this column:

Array(Tuple(String,String))

This basically means you can have an array of as many values as you want and insert into it touples of "description" "value".

So for one row your array could be:

[("metric_1":"val1"), ("metric_2":"val2")]

for the other:

[("metric_1":"val3"), ("metric_3":"val4"), ("metric_4":"val5")]

The idea here is that you can convert the values from string to any other type, so in essence you could store any type you want in there.

What if you need to know the type for every operatin and the type can differ ?... well:

array(Tuple(String,String,String))

and in the touples store "name","type","value"

This is the closest thing I could think of that might fit what you want. Of course, you should take a look at the array manipulation function to see if they offer you what you want (they are quite versatile, you can more or less do with an array everything you can with a row of the table itself).

What's the downside ?

Well, speed.

This will make querying very slow. Depending on what you want to do this may or may not be a problem for you. If you filter out your data well enough and hardly ever need to do queries on more than a few dozen or, at most, hundreds of millions of rows (and have good enough machines to handle the queries) than these dynamic array extensions could work.