Efficient way to Fill Time-Series per group

I was looking for a way to fill a time series data set by time, per group. The very very inefficient way I was using was to split the data set per group and apply a custom time-series fill function (create sequence between max and min, and merge) in all elements of that list. Needless to say, this operations would not go pass the splitting.

My dataset looks like,

    source                 grp cnt
 1:     83 2017-06-06 13:00:00   1
 2:     83 2017-06-06 23:00:00   1
 3:     83 2017-06-07 03:00:00   1
 4:     83 2017-06-07 07:00:00   2
 5:     83 2017-06-07 13:00:00   1
 6:     83 2017-06-07 19:00:00   1
 7:     83 2017-06-08 00:00:00   1
 8:     83 2017-06-08 14:00:00   1
 9:     83 2017-06-08 15:00:00   1
10:     83 2017-06-08 20:00:00   1
11:    137 2017-06-04 02:00:00   1
12:    137 2017-06-04 05:00:00   1
13:    137 2017-06-04 23:00:00   1
...

My attempt was to use tidyverse methods by utilising the complete function, i.e.

library(tidyverse)

d1 %>% 
 group_by(source) %>% 
 complete(source, grp = seq(min(grp), max(grp), by = 'hour'))

However, after about 40-45 seconds, a progress bar appeared (apparently a neat feature in some tidyverse functions - I suspect complete in this case) which estimated 9 hours to completion. My dataset is very very big and this is not the lightest operation, so something really efficient is what I am looking for.

DATA

#dput(d1)
structure(list(source = c("83", "83", "83", "83", "83", "83", 
"83", "83", "83", "83", "137", "137", "137", "137", "137", "137", 
"137", "137", "137", "137", "137", "137", "137", "137"), grp = structure(c(1496743200, 
1496779200, 1496793600, 1496808000, 1496829600, 1496851200, 1496869200, 
1496919600, 1496923200, 1496941200, 1496530800, 1496541600, 1496606400, 
1496617200, 1496649600, 1496696400, 1496808000, 1496844000, 1496876400, 
1496962800, 1497880800, 1497888000, 1497978000, 1497996000), class = c("POSIXct", 
"POSIXt"), tzone = ""), cnt = c(1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
)), .Names = c("source", "grp", "cnt"), row.names = c(NA, -24L
), class = "data.frame")

标签： r data.table tidyverse

2条回答

啃猪蹄的小仙女

2楼-- · 2019-01-15 17:14

This can be done using zoo as well. This is an order of magnitude faster than the code and data in the question but not as fast as the data.table solution although there exists the possibility of speeding it iup further if the last line of code shown below is not needed.

We read d1 into a zoo object z splitting it to give a multivariate time series having a column for each source. We then merge that with a zero width series having all the times and fortify that back to a data frame using the melt=TRUE argument to get a long form data.frame. If a wide form multivariate zoo series can be used then you could skip the last line in which case it would then be even faster.

library(zoo)

z <- read.zoo(d1, split = 1, index = 2) # wide form
zz <- merge(z, zoo(, seq(start(z), end(z), "hour"))) # expand
fortify(zz, melt = TRUE) # convert to long form data.frame

0人赞添加讨论(0) 举报

混吃等死

3楼-- · 2019-01-15 17:15

It appears that data.table is really much faster than the tidyverse option. So merely translating the above into data.table(compliments of @Frank) completed the operation in little under 3 minutes.

library(data.table)

mDT = setDT(d1)[, .(grp = seq(min(grp), max(grp), by = "hour")), by = source]
new_D <- d1[mDT, on = names(mDT)]

new_D <- new_D[, cnt := replace(cnt, is.na(cnt), 0)] #If needed

0人赞添加讨论(0) 举报

Efficient way to Fill Time-Series per group

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间