complex db2/sql query with time-sampling, group, m

2019-08-25 04:53发布

问题:

I have data in a table (named: TESTING) on a dashDB2 on IBM bluemix (Db2 Warehouse on Cloud) which is looking like this:

ID     TIMESTAMP                  NAME     VALUE
abc    2017-12-21 19:55:38.762    test1    123
abc    2017-12-21 19:55:42.762    test2    456
abc    2017-12-21 19:57:38.762    test1    789
abc    2017-12-21 19:58:38.762    test3    345
def    2017-12-21 19:59:38.762    test1    678

I am looking for a query that:

  1. samples the data (for each NAME) to a given timeformat (ex. to a 1 minute based timestamp)
  2. VALUES in same timerange (in same minute) should be averaged, empty times should be NULL

for 1. and 2. something like (only for one NAME working):

    with dummy(temporaer) as (
      select TIMESTAMP('2017-12-01') from SYSIBM.SYSDUMMY1
      union all
      select temporaer + 1 MINUTES from dummy where temporaer < TIMESTAMP('2018-02-01')
    )
    select temporaer, avg(VALUE) as test1 from dummy
    LEFT OUTER JOIN TESTING ON temporaer=date_trunc('minute', TIMESTAMP) and ID='abc' and NAME='test1'
    group by temporaer
    ORDER BY temporaer ASC;
  1. join all different NAMES column-wise to a matrix, like:

    TIMESTAMP               test1    test2    test3
    2017-12-01 00:00:00     null     null     null
    ...
    2017-12-21 19:55:00     123      456      null
    2017-12-21 19:56:00     null     null     null
    2017-12-21 19:57:00     789      null     null
    2017-12-21 19:58:00     678      null     345
    ...
    2018-01-31 23:59:00     null     null     null
    
  2. the query result should be exportet as a csv. or given back as csv-string

Does anybody know how this could be done in one query or in a simple and fast way? Or is it necessary to save the data in another tabe-format - can you give me a hint?

回答1:

here is a code snipped that does the job, but needs very long time:

WITH
    -- get all distinct names in table:
    header(names) AS (SELECT DiSTINCT name
                      FROM FIELDTEST
                      WHERE ID='7b9bbe44d45d8f2ac324849a4951da54' AND REGEXP_LIKE(trim(VALUE),'^\d+(\.\d*)?$') AND DATE(TIMESTAMP)>='2017-12-19' AND DATE(TIMESTAMP)<'2017-12-24'),

    -- select data (names, values without stringvalues) from table dedicated by timestamp to bigger timeinterval (here minutes):
    dummie(time, names, values) AS (SELECT date_trunc('minute', TIMESTAMP), NAME, VALUE
                                    FROM FIELDTEST
                                    WHERE ID='7b9bbe44d45d8f2ac324849a4951da54' AND REGEXP_LIKE(trim(VALUE),'^\d+(\.\d*)?$')),

    -- generate a range of times from date to date in defined steps:
    dummy(time, rangeEnd) AS (SELECT a, a + 1 MINUTE
                                           FROM (VALUES(TIMESTAMP('2017-12-19'))) D(a)
                                           UNION ALL
                                           SELECT rangeEnd, rangeEnd + 1 MINUTE
                                           FROM dummy
                                           WHERE rangeEnd < TIMESTAMP('2017-12-24')),

    -- add each name (from header) to each time/row (in dummy):
    dumpy(time, names) AS (SELECT Dummy.time, Header.names
                                FROM Dummy
                                LEFT OUTER JOIN Header
                                ON Dummy.time IS NOT NULL),

    -- averages values by name and timeinterval and sorts result to dummy:
    dummj(time, names, avgvalues) AS (SELECT Dummy.time, Dummie.names, AVG(Dummie.values)
                                      FROM Dummy
                                      LEFT OUTER JOIN Dummie
                                      ON Dummie.time = Dummy.time
                                      GROUP BY Dummie.names, Dummy.time),

    -- joins the averages (by time, name) values to the times and names in dumpy (on empty value use -9999):
    testo(time, names, avgvalues) AS (SELECT Dumpy.time, Dumpy.names, COALESCE(Dummj.avgvalues,-9999)
                                    FROM Dumpy
                                    LEFT OUTER JOIN Dummj
                                    ON Dummj.time = Dumpy.time AND Dummj.names = Dumpy.names),

    -- converts the high amount of rows to less rows with delimited strings:
    test(time, names, avgvalues) AS (SELECT time, LISTAGG(names,';') WITHIN GROUP(ORDER BY names), LISTAGG(avgvalues,';') WITHIN GROUP(ORDER BY names)
                                     FROM Testo
                                     GROUP BY time)

SELECT* FROM test ORDER BY time ASC, names ASC;

The performance problem is in the "testo" subquery. Does anybody have an idea what is the failure here or know how to improve the query?



回答2:

Well, one problem I see is that you keep using functions on columns, but that shouldn't be too big a drain if id is reasonably unique. If this query is very common, it may also be worth it to permanently build and index the range table. Hmm, you probably need several indices (starting with FieldTest.id), but you might also try this version:

-- let's name things properly, too, to keep them straight.
WITH
    -- generate a range of times from date to date in defined steps:
   Range (rangeStart, rangeEnd) AS (SELECT a, a + 1 MINUTE
                                    FROM (VALUES(TIMESTAMP('2017-12-19'))) D(a)
                                    UNION ALL
                                    SELECT rangeEnd, rangeEnd + 1 MINUTE
                                    FROM Range
                                    WHERE rangeEnd < TIMESTAMP('2017-12-24')),
    -- get all distinct names in table:
    Header(names) AS (SELECT DISTINCT name
                      FROM FieldTest
                      WHERE ID = '7b9bbe44d45d8f2ac324849a4951da54' 
                            -- just make the white space check part of the regex
                            AND REGEXP_LIKE(VALUE, '^\s*\d+(\.\d*)?\s*$') 
                            AND timestamp >= TIMESTAMP('2017-12-19')
                            AND timestamp < TIMESTAMP('2017-12-24')),
    -- I'm assuming the (id, name) tuple is unique, which means we don't need to repeat the regex later
    Data (rangeStart, name, averaged) AS (SELECT Range.rangeStart, Header.names, COALESCE(AVG(FieldTest.value), -9999)
                                          FROM Range
                                          CROSS JOIN Header
                                          LEFT JOIN FieldTest
                                                 ON FieldTest.id = '7b9bbe44d45d8f2ac324849a4951da54'
                                                    AND FieldTest.names = Header.names
                                                    AND FieldTest.timestamp >= Range.rangeStart
                                                    AND FieldTest.timestamp < Range.rangeEnd
                                          GROUP BY Range.rangeStart, Header.names),
-- I can't recall if DB2 allows using the new column name this way, you may need to wrap this again
SELECT rangeStart, 
               -- converts the high amount of rows to less rows with delimited strings:
       LISTAGG(names,';') WITHIN GROUP(ORDER BY names) AS names, 
       LISTAGG(avgvalues,';') WITHIN GROUP(ORDER BY names)
GROUP BY rangeStart
ORDER BY rangeStart, names

(not tested)



回答3:

the CROSS JOIN was defenitly a nice hint. Also I was not able to implement the following LEFT JOIN like you suggested, I found a workaround, which - I am sure - still keeps room for improvement but at this moment is acceptable for me (timesaving about factor 30 compared to my first query solution). Here the actual code:

WITH
    -- generate a range of times from date to date in defined steps:
    TimeRange(rangeStart, rangeEnd) AS (SELECT a, a + 1 MINUTE
                                                            FROM (VALUES(TIMESTAMP('2017-12-19'))) D(a)
                                                            UNION ALL
                                                            SELECT rangeEnd, rangeEnd + 1 MINUTE
                                                            FROM TimeRange
                                                            WHERE rangeEnd < TIMESTAMP('2017-12-24')),

    -- get all distinct names in table:
    Header(names) AS (SELECT DISTINCT name
                                     FROM FIELDTEST
                                     WHERE ID = '7b9bbe44d45d8f2ac324849a4951da54'
                                                   AND REGEXP_LIKE(VALUE, '^\s*\d+(\.\d*)?\s*$') 
                                                   AND timestamp >= TIMESTAMP('2017-12-19')
                                                   AND timestamp < TIMESTAMP('2017-12-24')),

    -- select data (names, values without stringvalues) from table dedicated by timestamp to bigger timeinterval (here minutes):
    rawData(time, names, values) AS (SELECT date_trunc('minute', TIMESTAMP), NAME, VALUE
                                     FROM FIELDTEST
                                     WHERE ID = '7b9bbe44d45d8f2ac324849a4951da54'
                                        AND REGEXP_LIKE(VALUE, '^\s*\d+(\.\d*)?\s*$')),

    -- I'm assuming the (id, name) tuple is unique, which means we don't need to repeat the regex later
    Data(rangeStart, name, averaged) AS (SELECT TimeRange.rangeStart, Header.names, COALESCE(AVG(rawData.values), -9999)
                                         FROM TimeRange
                                         CROSS JOIN Header
                                         LEFT JOIN rawData
                                            ON rawData.names = Header.names
                                                AND rawData.time = TimeRange.rangeStart
                                         GROUP BY TimeRange.rangeStart, Header.names),

    test(time, names, avgvalues) AS (SELECT Data.rangeStart,
                                            LISTAGG(Data.name,';') WITHIN GROUP(ORDER BY name),
                                            LISTAGG(Data.averaged,';') WITHIN GROUP(ORDER BY name)
                                    FROM Data
                                    GROUP BY Data.rangeStart)

-- build my own delimited export-string:
SELECT CONCAT(CONCAT(SUBSTR(REPLACE(time,'.',':'),1,19),';'), REPLACE(CAST(avgvalues AS VARCHAR(3980)),'-9999',''))
FROM test
UNION ALL
SELECT CONCAT(CAST('TIME;' AS VARCHAR(5)), CAST(LISTAGG(names,';') WITHIN GROUP(ORDER BY names) AS VARCHAR(3980)))
FROM Header;