Is there a way to select a subset from objects (data frames, matrices, vectors) without making a copy of selected data?
I work with quite large data sets, but never change them. However often for convenience I select subsets of the data to operate on. Making a copy of a large subset each time is very memory inefficient, but both normal indexing and subset
(and thus xapply()
family of functions) create copies of selected data. So I'm looking for functions or data structures that can overcome this issue.
Some possible approaches that may fit my needs and hopefully are implemented in some R packages:
- copy-on-write mechanism, i.e. data structures that are copied only when you add or rewrite existing elements;
- immutable data structures, that only require recreating indexing information for the data structure, but not its content (like making substring from the string by only creating small object that holds length and a pointer to the same char array);
xapply()
analogues that do not create subsets.
Try package ref. Specifically, its
refdata
class.What you might be missing about
data.table
is that when grouping (by=
parameter) the subsets of data are not copied, so that's fast. [Well technically they are but into a shared area of memory which is reused for each group, and copied using memcpy which is much faster than R's for loops in C.]:=
indata.table
is one way to modify adata.table
in place.data.table
departs from usual R programming style in that it is not copied-on-write. User has to callcopy()
explicitly to copy a (potentially very large) table, even within a function.You're right that there isn't a mechanism like
refdata
built intodata.table
. I see what you mean and it would be a nice feature.refdata
should work on adata.table
, though, and you might be fine withdata.frame
(but be sure to monitor copies withtracemem(DF)
).There is also
idata.frame
(immutabledata.frame
) in packageplyr
you could try.