连续重复/重复项的有序计数(Ordered count of consecutive repeats

2019-07-02 16:04发布

我很怀疑,我以最有效的方式,这就是为什么我做标记这个plpgsql就在这里。 我需要2个十亿行一千测量系统运行此。

你必须测量系统,当他们失去连接经常报告的是价值,而他们失去连接的苗头频繁,但有时很长一段时间。 你需要聚集,但是当你这样做,你需要看一下它是重复了多久,并根据这些信息不同的过滤器。 说你是衡量MPG上一辆车,但它停留在20英里一小时比四周移动至20.1等。 你会想,当它粘到评估的准确性。 您还可以地方,寻找当汽车在高速公路上一些替代性的规则,并与窗口功能,您可以生成汽车的“国家”的事,并组。 无需再费周折:

--here's my data, you have different systems, the time of measurement, and the actual measurement
--as well, the raw data has whether or not it's a repeat (hense the included window function
select * into temporary table cumulative_repeat_calculator_data
FROM
    (
    select 
    system_measured, time_of_measurement, measurement, 
    case when 
     measurement = lag(measurement,1) over (partition by system_measured order by time_of_measurement asc) 
     then 1 else 0 end as repeat
    FROM
    (
    SELECT 5 as measurement, 1 as time_of_measurement, 1 as system_measured
    UNION
    SELECT 150 as measurement, 2 as time_of_measurement, 1 as system_measured
    UNION
    SELECT 5 as measurement, 3 as time_of_measurement, 1 as system_measured
    UNION
    SELECT 5 as measurement, 4 as time_of_measurement, 1 as system_measured
    UNION
    SELECT 5 as measurement, 1 as time_of_measurement, 2 as system_measured
    UNION
    SELECT 5 as measurement, 2 as time_of_measurement, 2 as system_measured
    UNION
    SELECT 5 as measurement, 3 as time_of_measurement, 2 as system_measured
    UNION
    SELECT 5 as measurement, 4 as time_of_measurement, 2 as system_measured
    UNION
    SELECT 150 as measurement, 5 as time_of_measurement, 2 as system_measured
    UNION
    SELECT 5 as measurement, 6 as time_of_measurement, 2 as system_measured
    UNION
    SELECT 5 as measurement, 7 as time_of_measurement, 2 as system_measured
    UNION
    SELECT 5 as measurement, 8 as time_of_measurement, 2 as system_measured
    ) as data
) as data;

--unfortunately you can't have window functions within window functions, so I had to break it down into subquery
--what we need is something to partion on, the 'state' of the system if you will, so I ran a running total of the nonrepeats
--this creates a row that stays the same when your data is repeating - aka something you can partition/group on
select * into temporary table cumulative_repeat_calculator_step_1
FROM
    (
    select 
    *,
    sum(case when repeat = 0 then 1 else 0 end) over (partition by system_measured order by time_of_measurement asc) as cumlative_sum_of_nonrepeats_by_system
    from cumulative_repeat_calculator_data
    order by system_measured, time_of_measurement
) as data;

--finally, the query. I didn't bother showing my desired output, because this (finally) got it
--I wanted a sequential count of repeats that restarts when it stops repeating, and starts with the first repeat
--what you can do now is take the average measurement under some condition based on how long it was repeating, for example  
select *, 
case when repeat = 0 then 0
else
row_number() over (partition by cumlative_sum_of_nonrepeats_by_system, system_measured order by time_of_measurement) - 1
end as ordered_repeat
from cumulative_repeat_calculator_step_1
order by system_measured, time_of_measurement

所以,你会为了一个巨大的表上运行,或者是你会用什么替代工具有什么不同? 我想PLPGSQL,因为我怀疑这需要做的数据库内,或者在数据插入过程中,虽然我它的加载后的数据一般工作。 有没有办法让这一次扫描,而不诉诸子查询?

我测试了一个可供选择的方法 ,但它仍然依赖于一个子查询,我认为这是更快的。 对于这种方法,你创建一个“启动和停止”表start_timestamp,end_timestamp,系统。 然后你加入到更大的表,如果时间戳是那些之间,你把它归类为在该状态下,这本质上是一种替代是cumlative_sum_of_nonrepeats_by_system 。 但是,当你这样做,你在1 = 1加入千“事件”的设备和成千上万的。 你认为这是一个更好的方式去?

Answer 1:

测试用例

首先,更为有效的方式来呈现你的数据-甚至更好,在sqlfiddle ,准备一起玩:

CREATE TEMP TABLE data(
   system_measured int
 , time_of_measurement int
 , measurement int
);

INSERT INTO data VALUES
 (1, 1, 5)
,(1, 2, 150)
,(1, 3, 5)
,(1, 4, 5)
,(2, 1, 5)
,(2, 2, 5)
,(2, 3, 5)
,(2, 4, 5)
,(2, 5, 150)
,(2, 6, 5)
,(2, 7, 5)
,(2, 8, 5);

简化查询

由于目前仍不清楚,给出我假设只以上。
接下来,我简化了查询到到达:

WITH x AS (
   SELECT *, CASE WHEN lag(measurement) OVER (PARTITION BY system_measured
                               ORDER BY time_of_measurement) = measurement
                  THEN 0 ELSE 1 END AS step
   FROM   data
   )
   , y AS (
   SELECT *, sum(step) OVER(PARTITION BY system_measured
                            ORDER BY time_of_measurement) AS grp
   FROM   x
   )
SELECT * ,row_number() OVER (PARTITION BY system_measured, grp
                             ORDER BY time_of_measurement) - 1 AS repeat_ct
FROM   y
ORDER  BY system_measured, time_of_measurement;

现在,虽然它是所有好的和有光泽使用纯SQL,这将是一个PLPGSQL功能快得多,因为它可以在一个表扫描,其中该查询至少需要三次扫描做到这一点。

与PLPGSQL功能更快:

CREATE OR REPLACE FUNCTION x.f_repeat_ct()
  RETURNS TABLE (
    system_measured int
  , time_of_measurement int
  , measurement int, repeat_ct int
  )  LANGUAGE plpgsql AS
$func$
DECLARE
   r    data;     -- table name serves as record type
   r0   data;
BEGIN

-- SET LOCAL work_mem = '1000 MB';  -- uncomment an adapt if needed, see below!

repeat_ct := 0;   -- init

FOR r IN
   SELECT * FROM data d ORDER BY d.system_measured, d.time_of_measurement
LOOP
   IF  r.system_measured = r0.system_measured
       AND r.measurement = r0.measurement THEN
      repeat_ct := repeat_ct + 1;   -- start new array
   ELSE
      repeat_ct := 0;               -- start new count
   END IF;

   RETURN QUERY SELECT r.*, repeat_ct;

   r0 := r;                         -- remember last row
END LOOP;

END
$func$;

呼叫:

SELECT * FROM x.f_repeat_ct();

一定要在这种PLPGSQL功能的任何时候都表限定列名,因为我们使用相同的名称,如果没有合格的,其将优先输出参数。

行数十亿

如果你有几十亿行的 ,你可能需要分割该操作了。 我引用手册这里 :

注:目前执行的RETURN NEXTRETURN QUERY存储整个结果从函数返回,如上所述之前设置。 这意味着,如果一个PL / pgSQL函数产生一个非常大的结果集,性能可能会很差:数据将被写入磁盘,以避免内存耗尽,但是函数本身将不会返回,直到整个结果集已生成。 PL / pgSQL的未来版本可能会允许用户定义没有这个限制设定返回功能。 目前,在该数据开始被写入到磁盘上的点由控制work_mem配置变量。 管理员谁拥有足够的内存来存储在内存中更大的结果集应考虑增加此参数。

在一个时间考虑一个系统计算的行或设定为一个足够高的值work_mem应付负载。 按照报价提供更多关于work_mem的链接。

一种方法是设置一个很高的值work_memSET LOCAL在你的函数,这是仅针对当前事务有效。 我加入功能的注释行。 不要把它设置很高的全局,因为这可能核弹攻击你的服务器。 阅读手册。



文章来源: Ordered count of consecutive repeats / duplicates