According to Creating an R dataframe row-by-row, it's not ideal to append to a data.frame
using rbind
, as it creates a copy of the whole data.frame each time. How do I accumulate data in R
resulting in a data.frame
without incurring this penalty? The intermediate format doesn't need to be a data.frame
.
相关问题
- R - Quantstart: Testing Strategy on Multiple Equit
- What uses more memory in c++? An 2 ints or 2 funct
- How to remove spaces in between characters without
- Using predict with svyglm
- Reshape matrix by rows
相关文章
- How to convert summary output to a data frame?
- How to plot smoother curves in R
- Paste all possible diagonals of an n*n matrix or d
- ess-rdired: I get this error “no ESS process is as
- How to use doMC under Windows or alternative paral
- dyLimit for limited time in Dygraphs
- Saving state of Shiny app to be restored later
- Why are memory addresses incremented by 4 in MIPS?
Well, I am very surprised that nobody mentioned the conversion to a matrix yet...
Comparing with the dt.colon and dt.set functions defined by Ari B. Friedman , the conversion to a matrix has the best running time (slightly quicker than dt.colon). All affectations inside a matrix are done by reference, so there is no unnecessary memory copy performed in this code.
CODE:
RESULT:
Pros of using a matrix:
Con of using a matrix:
First approach
I tried accessing each element of a pre-allocated data.frame:
But tracemem goes crazy (e.g. the data.frame is being copied to a new address each time).
Alternative approach (doesn't work either)
One approach (not sure it's faster as I haven't benchmarked yet) is to create a list of data.frames, then
stack
them all together:Unfortunately in creating the list I think you will be hard-pressed to pre-allocate. For instance:
In other words, replacing an element of the list causes the list to be copied. I assume the whole list, but it's possible it's only that element of the list. I'm not intimately familiar with the details of R's memory management.
Probably the best approach
As with many speed or memory-limited processes these days, the best approach may well be to use
data.table
instead of adata.frame
. Sincedata.table
has the:=
assign by reference operator, it can update without re-copying:But as @MatthewDowle points out,
set()
is the appropriate way to do this inside a loop. Doing so makes it faster still:(Results shown below)
Benchmarking
With the loop run 10,000 times, data table is almost a full order of magnitude faster:
And comparison of
:=
withset()
:Note that
n
here is 10^6 not 10^5 as in the benchmarks plotted above. So there's an order of magnitude more work, and the result is measured in milliseconds not seconds. Impressive indeed.You could also have an empty list object where elements are filled with dataframes; then collect the results at the end with sapply or similar. An example can be found here. This will not incur the penalties of growing an object.
I like
RSQLite
for that matter:dbWriteTable(...,append=TRUE)
statements while collecting, anddbReadTable
statement at the end.If the data is small enough, one can use the ":memory:" file, if it is big, the hard disk.
Of course, it can not compete in terms of speed:
But it might look better if the
data.frame
s have more than one row. And you do not need to know the number of rows in advance.This post suggests stripping off
data.frame
/tibble
's class attributes usingas.list
, assigning list elements in-place the usual way and then converting the result back todata.frame
/tibble
again. The computational complexity of this method grows linearly but with a very little rate of less than 10e-6.Here is an image from the original article: