I'm implementing an algorithm that involves lots of adding and removing things from sets. In R, this is slow because as far as I know, adding or removing things from a vector is slow, since the entire vector has to be re-allocated. Is there a way do do it more efficiently?
Edit: My current solution is to use a boolean vector of the same length as the list of things that can be in the set, and using that as a membership table.
Chapter 2 of The R Inferno has some interesting comments on this, including perdiodic growing objects to reduce memory fragmentation and allocation overhead.
If you know what the ultimate size of the set is, then the method you suggest is probably the best - ie subset
from the whole universe using an approprate membership vector. Difficult to know whats best without seeing exactly what you are trying to do though.
If you can, initializing a vector so that it has length equal to its maximum length during the algorithm may help.
e.g.
vec <- rep(NA,10)
vec[1:3] <- 1:3
vec[4:5] <- 4:5
vec[6:10] <- 6:10
rather than
vec <- 1:3
vec <- c(vec,4:5)
vec <- c(vec,6:10)
compare
> system.time({vec <- rep(NA,10^4); for (i in 1:(10^4)) vec[i] <- i })
user system elapsed
0.043 0.001 0.044
to
> system.time({vec <- NULL; for (i in 1:(10^4)) vec <- c(vec,i) })
user system elapsed
0.249 0.089 0.335
It's hard to say what you want. Maybe you really want stack commands like push and pop. The following isn't that. But it's a fast solution.
Allocate a vector large enough to hold all of your items of the type you need. Set every value to NA. Adding items is simple. Removing items is setting them to NA again. Using the vector is just na.omit(myVec)
myVec <- numeric (maxLength) # a vector of maximum length
is.na(myVec) <- 1:maxLength # set every item in myVec to NA
myVec[c(2,6,20)] <- 5 # add some values
na.omit(myVec)
#This will also work if you can initialize all of your values to something that you know you won't need.
Yes, there are more efficient ways.
It comes down to how you're using the data; your use case. Are you taking out the data in in the same order you put it in, or in reversed order, or in random order, or in a sorted order?
For FIFO, for a fixed size array use a circular buffer, or for a fully dynamic size, use a deque (pronounced deck). (This is probably what you want.)
For FILO, use a stack.
For grabbing the data at random, consider using a 1 column matrix you never resize. Resizing is slow.
If you need an ordered set (eg c(3,2,5) -> c(2,3,5)
), look into a tree or a heap.
- A tree is good for comparisons, eg grabbing all elements in a set that is < 5.5.
- A heap is good for grabbing just the largest or smallest elements.