Why is the time complexity of this loop non-linear and why is it so slow? The loop takes ~38s for N=50k,
and ~570s for N=200k
. Is there a faster way to do this? Rprof()
seems to indicate that writing to memory is very slow.
df <- data.frame(replicate(5, runif(200000)))
df[,1:3] <- round(df[,1:3])
Rprof(line.profiling = TRUE); timer <- proc.time()
x <- df; N <- nrow(df); i <- 1
ind <- df[1:(N-1),1:3] == df[2:N,1:3];
rind <- which(apply(ind,1,all))
N <- length(rind)
while(i <= N)
{
x$X4[rind[i]+1] <- x$X4[rind[i]+1] + x$X4[rind[i]]
x$X5[rind[i]+1] <- x$X4[rind[i]+1] * x$X3[rind[i]+1]
x$X5[rind[i]+1] <- trunc(x$X5[rind[i]+1]*10^8)/10^8
x$X1[rind[i]] <- NA
i <- i + 1
};x <- na.omit(x)
proc.time() - timer; Rprof(NULL)
summaryRprof(lines = "show")
The purpose of this algorithm is to iterate over the data frame and combine adjacent rows that match on certain elements. That is, it removes one of the rows and adds some of that row's values to the other row. The resulting data frame should have n less rows, where n is the number of matching adjacent rows in the original data frame. Every time a pair of rows are combined, the index of the source data frame and new data frame get out of sync by 1, since one row is removed/omitted from the new frame, so i
keeps track of the position on the source data frame, and q
keeps track of the position on the new data frame.
The code above is updated thanks to @joran's comment. The performance is improved substantially to ~5.5s for N=50k
and ~88s for N=200k
. However, the time complexity is still non-linear, which I can't fathom. I need to run this at N = 1 million or more, so its still not great speed.
Following is just a rewrite of @Martin Morgan's answer, utilizing the fast subsetting of
data.table
. It is around 3x faster than thedata.frame
approach.Only the
X4
column update depends on previous values, so the loop can be mostly 'vectorized' (with a little bit of optimization, avoiding addition of 1 torind
in each iteration) asX4
is a numeric value and the update can be made more efficient by updating it as a vector rather than a column of a data.frameFor comparison, we have
The results are the same
The speedup is substantial (using
library(microbenchmark)
)The reason for the difference can be seen when R has been compiled with memory profiling enabled --
Each line indicates a memory copy, so updating a cell in a data frame incurs 5 copies of the outer structure or the vector itself. In contrast, a vector can be updated without any copies.
(The first assignment is expensive because it represents the duplication of the data.frame column; subsequent updates are to
X4
, onlyX4
refers to the vector being updated, and the vector does not need to be duplicated).The data.frame implementation does seem to scale non-linearly
The reason is apparent in the second line of the tracemem output above -- updating a row triggers a copy of the entire column. So the algorithm scales as the number of rows to update times the number of rows in a column, approximately quadratic.
f4a()
appears to scale linearlyOne could try and be clever about vectorizing the loop, but is it now necessary?
A tuned version of the data processing part of the function uses negative indexing (e.g.,
-nrow(df)
) to remove rows from the data frame,rowSums()
instead ofapply()
, andunname()
so that subset operations don't carry around unused names:Compared to the data.table solution suggested by @Khashaa
the base R version performs favorably with times
(The pre-tuning version in f4a takes about 760ms, so more than twice as slow).
The results from the data.table implementation are not correct
and I'm not enough of a data.table wizard (barely a data.table user) to know what the correct formulation is.
Compiling (benefits exclusively from the for loop?) increases speed by about 20%