I have quite large List named items (>= 1,000,000 items) and some condition denoted by <cond> that selects items to be deleted and <cond> is true for many (maybe half) of items on my list.
My goal is to efficiently remove items selected by <cond> and retain all other items, source list may be modified, new list may be created - best way to do it should be chosen considering performance.
Here is my testing code:
System.out.println("preparing items");
List<Integer> items = new ArrayList<Integer>(); // Integer is for demo
for (int i = 0; i < 1000000; i++) {
items.add(i * 3); // just for demo
}
System.out.println("deleting items");
long startMillis = System.currentTimeMillis();
items = removeMany(items);
long endMillis = System.currentTimeMillis();
System.out.println("after remove: items.size=" + items.size() +
" and it took " + (endMillis - startMillis) + " milli(s)");
and naive implementation:
public static <T> List<T> removeMany(List<T> items) {
int i = 0;
Iterator<T> iter = items.iterator();
while (iter.hasNext()) {
T item = iter.next();
// <cond> goes here
if (/*<cond>: */i % 2 == 0) {
iter.remove();
}
i++;
}
return items;
}
As you can see I used item index modulo 2 == 0 as remove condition (<cond>) - just for demonstation purposes.
What better version of removeMany
may be provided and why this better version is actually better?
Try implementing recursion into your algorithm.
I would make a new
List
to add the items to, since removing an item from the middle of a List is quite expensive.EDIT: I haven't tested this, so there may very well be small syntax errors.
Second EDIT: Using a
LinkedList
is better when you don't need random access but fast add times.BUT...
The constant factor for
ArrayList
is smaller than that forLinkedList
(Ref). Since you can make a reasonable guess of how many elements will be removed (you said "about half" in your question), adding an element to the end of anArrayList
is O(1) as long as you don't have to re-allocate it. So, if you can make a reasonable guess, I would expect theArrayList
to be marginally faster than theLinkedList
in most cases. (This applies to the code I have posted. In your naive implementatation, I thinkLinkedList
will be faster).Since speed is the most important metric, there's the possibility of using more memory and doing less recreation of lists (as mentioned in my comment). Actual performance impact would be fully dependent on how the functionality is used, though.
The algorithm assumes that at least one of the following is true:
Disclaimer: There's prolly syntax errors - I didn't try compiling anything.
First, subclass the ArrayList
Then we need the helper classes:
and, of course, we need the condition interface:
And a condition with which to test
and we're finally ready for some test code
Maybe a List isn't the optimal data structure for you? Can you change it? Perhaps you can use a tree where the items are sorted into in a way that deleting one node removes all items that meet the condition? Or that at least speed up your operations?
In your simplistic example using two lists (one with the items where i % 2 != 0 is true and one with the items where i % 2 != 0 is false) could serve well. But this is of course very domain dependent.
As others have said, your first inclination should be to just build up a second list.
But, if you'd like to also try out editing the list in-place, the efficient way to do that is to use
Iterables.removeIf()
from Guava. If its argument is a list, it coalesces the retained elements toward the front then simply chops off the end -- much faster than removing() interior elements one by one.I would imagine that building a new list, rather than modifying the existing list, would be more performant - especially when the number of items is as large as you indicate. This assumes, your list is an
ArrayList
, not aLinkedList
. For a non-circularLinkedList
, insertion is O(n), but removal at an existing iterator position is O(1); in which case your naive algorithm should be sufficiently performant.Unless the list is a
LinkedList
, the cost of shifting the list each time you callremove()
is likely one of the most expensive parts of the implementation. For array lists, I would consider using: