我很怀疑,我以最有效的方式,这就是为什么我做标记这个plpgsql
就在这里。 我需要2个十亿行了一千测量系统运行此。
你必须测量系统,当他们失去连接经常报告的是价值,而他们失去连接的苗头频繁,但有时很长一段时间。 你需要聚集,但是当你这样做,你需要看一下它是重复了多久,并根据这些信息不同的过滤器。 说你是衡量MPG上一辆车,但它停留在20英里一小时比四周移动至20.1等。 你会想,当它粘到评估的准确性。 您还可以地方,寻找当汽车在高速公路上一些替代性的规则,并与窗口功能,您可以生成汽车的“国家”的事,并组。 无需再费周折:
--here's my data, you have different systems, the time of measurement, and the actual measurement
--as well, the raw data has whether or not it's a repeat (hense the included window function
select * into temporary table cumulative_repeat_calculator_data
FROM
(
select
system_measured, time_of_measurement, measurement,
case when
measurement = lag(measurement,1) over (partition by system_measured order by time_of_measurement asc)
then 1 else 0 end as repeat
FROM
(
SELECT 5 as measurement, 1 as time_of_measurement, 1 as system_measured
UNION
SELECT 150 as measurement, 2 as time_of_measurement, 1 as system_measured
UNION
SELECT 5 as measurement, 3 as time_of_measurement, 1 as system_measured
UNION
SELECT 5 as measurement, 4 as time_of_measurement, 1 as system_measured
UNION
SELECT 5 as measurement, 1 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 2 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 3 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 4 as time_of_measurement, 2 as system_measured
UNION
SELECT 150 as measurement, 5 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 6 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 7 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 8 as time_of_measurement, 2 as system_measured
) as data
) as data;
--unfortunately you can't have window functions within window functions, so I had to break it down into subquery
--what we need is something to partion on, the 'state' of the system if you will, so I ran a running total of the nonrepeats
--this creates a row that stays the same when your data is repeating - aka something you can partition/group on
select * into temporary table cumulative_repeat_calculator_step_1
FROM
(
select
*,
sum(case when repeat = 0 then 1 else 0 end) over (partition by system_measured order by time_of_measurement asc) as cumlative_sum_of_nonrepeats_by_system
from cumulative_repeat_calculator_data
order by system_measured, time_of_measurement
) as data;
--finally, the query. I didn't bother showing my desired output, because this (finally) got it
--I wanted a sequential count of repeats that restarts when it stops repeating, and starts with the first repeat
--what you can do now is take the average measurement under some condition based on how long it was repeating, for example
select *,
case when repeat = 0 then 0
else
row_number() over (partition by cumlative_sum_of_nonrepeats_by_system, system_measured order by time_of_measurement) - 1
end as ordered_repeat
from cumulative_repeat_calculator_step_1
order by system_measured, time_of_measurement
所以,你会为了一个巨大的表上运行,或者是你会用什么替代工具有什么不同? 我想PLPGSQL,因为我怀疑这需要做的数据库内,或者在数据插入过程中,虽然我它的加载后的数据一般工作。 有没有办法让这一次扫描,而不诉诸子查询?
我测试了一个可供选择的方法 ,但它仍然依赖于一个子查询,我认为这是更快的。 对于这种方法,你创建一个“启动和停止”表start_timestamp,end_timestamp,系统。 然后你加入到更大的表,如果时间戳是那些之间,你把它归类为在该状态下,这本质上是一种替代是cumlative_sum_of_nonrepeats_by_system
。 但是,当你这样做,你在1 = 1加入千“事件”的设备和成千上万的。 你认为这是一个更好的方式去?