I am working with RMongoDB and I need to fill an empty data.frame with the values of a query. The results are quite long, about 2 milion documents (rows).
While I was doing performance tests, I found out that the time for writing the values to a row increases by the dimension of the data frame. Maybe it is a well known issue and I am the last one to notice it.
Some code example:
set.seed(20140430)
nreg <- 2e3
dfres <- as.data.frame(matrix(rep(NA,nreg*7),nrow=nreg,ncol=7))
system.time(dfres[1e3,] <- c(1:5,"a","b"))
summary(replicate(10,system.time(dfres[sample(1:nreg,1),] <- c(1:5,"a","b"))[3]))
nreg <- 2e6
dfres <- as.data.frame(matrix(rep(NA,nreg*7),nrow=nreg,ncol=7))
system.time(dfres[1e3,] <- c(1:5,"a","b"))
summary(replicate(10,system.time(dfres[sample(1:nreg,1),] <- c(1:5,"a","b"))[3]))
On my machine, the assignment at the 2 milion rows data.frame takes about 0.4 seconds. This is a lot of time if I want to fill the whole dataset. Here goes a second simulation in order to draw the issue.
nreg <- seq(2e1,2e7,length.out=10)
te <- NULL
for(i in nreg){
dfres <- as.data.frame(matrix(rep(NA,i*7),nrow=i,ncol=7))
te <- c(te,mean(replicate(10,{r <- sample(1:i,1); system.time(dfres[r,] <- c(1:5,"a","b"))[3]}) ) )
}
plot(nreg,te,xlab="Number of rows",ylab="Avg. time for 10 random assignments [sec]",type="o")
#rm(nreg,dfres,te)
Question: Why this happens? Is there a quicker way to fill the data.frame in memory?
Let's start with "columns" first and see what goes on and then return to rows.
R versions < 3.1.0 (unnecessarily) copies the entire
data.frame
when you operate on them. For example:You can see that addition of "new" column has resulted in a copy of the "old" columns (the addresses are different). Also the attributes are copied. What bites most is that these copies are deep copies, as opposed to shallow copies.
However, in R v3.1.0, there has been nice welcoming changes in that the "old" columns are not deep copied. All credits to the R core dev team.
You can see that the columns
x
andy
aren't changed at all (and therefore not present in the output ofchanges
function call). This is a huge (and welcoming) improvement!So far, we looked at the issue in adding columns in R <3.1.0 and v3.1.0
Now, coming to your question: so, what about the "rows"? Let's consider older version of R first and then come back to R v3.1.0.
Once again we see that changing column
y
has resulted in copying columnx
as well in older versions of R.We see the nice improvements in R v3.1.0 which has resulted in the copy of just column
y
. Once again, great improvements in R v3.1.0! R's copy-on-modify has gotten wiser.That is, we can sub-assign by reference using
data.table
. We can use either:=
orset()
to do this. I'll demonstrate usingset()
here.Now, here's a comparison with base R and
data.table
on your data with 2,000 to 20,000,000 rows in multiples of 10, against R v3.0.3 and v3.1.0 separately. You can find the code here.Plot for comparison against R v3.0.3:
Plot for comparison against R v3.1.0:
The min, median and max for R v3.0.3, R v3.1.0 and data.table on 20 million rows with 10 replications are:
This clearly shows the improvement in R v3.1.0, but also shows that the column which is being changed is still being copied and that still consumes sometime, which is overcome through sub-assignment by reference in
data.table
.HTH