How to efficiently (performance) remove many items

2020-05-17 03:59发布

I have quite large List named items (>= 1,000,000 items) and some condition denoted by <cond> that selects items to be deleted and <cond> is true for many (maybe half) of items on my list.

My goal is to efficiently remove items selected by <cond> and retain all other items, source list may be modified, new list may be created - best way to do it should be chosen considering performance.

Here is my testing code:

    System.out.println("preparing items");
    List<Integer> items = new ArrayList<Integer>(); // Integer is for demo
    for (int i = 0; i < 1000000; i++) {
        items.add(i * 3); // just for demo
    }

    System.out.println("deleting items");
    long startMillis = System.currentTimeMillis();
    items = removeMany(items);
    long endMillis = System.currentTimeMillis();

    System.out.println("after remove: items.size=" + items.size() + 
            " and it took " + (endMillis - startMillis) + " milli(s)");

and naive implementation:

public static <T> List<T> removeMany(List<T> items) {
    int i = 0;
    Iterator<T> iter = items.iterator();
    while (iter.hasNext()) {
        T item = iter.next();
        // <cond> goes here
        if (/*<cond>: */i % 2 == 0) {
            iter.remove();
        }
        i++;
    }
    return items;
}

As you can see I used item index modulo 2 == 0 as remove condition (<cond>) - just for demonstation purposes.

What better version of removeMany may be provided and why this better version is actually better?

12条回答
▲ chillily
2楼-- · 2020-05-17 04:53

OK, it's time for test results of proposed approaches. Here what approaches I have tested (name of each approach is also class name in my sources):

  1. NaiveRemoveManyPerformer - ArrayList with iterator and remove - first and naive implementation given in my question.
  2. BetterNaiveRemoveManyPerformer - ArrayList with backward iteration and removal from end to front.
  3. LinkedRemoveManyPerformer - naive iterator and remove but working on LinkedList. Disadventage: works only for LinkedList.
  4. CreateNewRemoveManyPerformer - ArrayList is made as a copy (only retained elements are added), iterator is used to traverse input ArrayList.
  5. SmartCreateNewRemoveManyPerformer - better CreateNewRemoveManyPerformer - initial size (capacity) of result ArrayList is set to final list size. Disadvantage: you must know final size of list when starting.
  6. FasterSmartCreateNewRemoveManyPerformer - even better (?) SmartCreateNewRemoveManyPerformer - use item index (items.get(idx)) instead of iterator.
  7. MagicRemoveManyPerformer - works in place (no list copy) for ArrayList and compacts holes (removed items) from beginning with items from end of the list. Disadventage: this approach changes order of items in list.
  8. ForwardInPlaceRemoveManyPerformer - works in place for ArrayList - move retaining items to fill holes, finally subList is returned (no final removal or clear).
  9. GuavaArrayListRemoveManyPerformer - Google Guava Iterables.removeIf used for ArrayList - almost the same as ForwardInPlaceRemoveManyPerformer but does final removal of items at the end of list.

Full source code is given at the end of this answer.

Tests where performed with different list sizes (from 10,000 items to 10,000,000 items) and different remove factors (specifying how many items must be removed from list).

As I posted here in comments for other answers - I have thought that copying items from ArrayList to second ArrayList will be faster than iterating LinkedList and just removing items. Sun's Java Documentation says that constant factor of ArrayList is low compared to that for the LinkedList implementation, but surprisingly this is not the case in my problem.

In practice LinkedList with simple iteration and removal has best performance in most cases (this approach is implemented in LinkedRemoveManyPerformer). Usually only MagicRemoveManyPerformer performance is comparable to LinkedRemoveManyPerformer, other approaches are significantly slower. Google Guava GuavaArrayListRemoveManyPerformer is slower than hand made similar code (because my code does not remove unnecessary items at end of list).

Example results for removing 500,000 items from 1,000,000 source items:

  1. NaiveRemoveManyPerformer: test not performed - I'm not that patient, but it performs worse than BetterNaiveRemoveManyPerformer.
  2. BetterNaiveRemoveManyPerformer: 226080 milli(s)
  3. LinkedRemoveManyPerformer: 69 milli(s)
  4. CreateNewRemoveManyPerformer: 246 milli(s)
  5. SmartCreateNewRemoveManyPerformer: 112 milli(s)
  6. FasterSmartCreateNewRemoveManyPerformer: 202 milli(s)
  7. MagicRemoveManyPerformer: 74 milli(s)
  8. ForwardInPlaceRemoveManyPerformer: 69 milli(s)
  9. GuavaArrayListRemoveManyPerformer: 118 milli(s)

Example results for removing 1 item from 1,000,000 source items (first item is removed):

  1. BetterNaiveRemoveManyPerformer: 34 milli(s)
  2. LinkedRemoveManyPerformer: 41 milli(s)
  3. CreateNewRemoveManyPerformer: 253 milli(s)
  4. SmartCreateNewRemoveManyPerformer: 108 milli(s)
  5. FasterSmartCreateNewRemoveManyPerformer: 71 milli(s)
  6. MagicRemoveManyPerformer: 43 milli(s)
  7. ForwardInPlaceRemoveManyPerformer: 73 milli(s)
  8. GuavaArrayListRemoveManyPerformer: 78 milli(s)

Example results for removing 333,334 items from 1,000,000 source items:

  1. BetterNaiveRemoveManyPerformer: 253206 milli(s)
  2. LinkedRemoveManyPerformer: 69 milli(s)
  3. CreateNewRemoveManyPerformer: 245 milli(s)
  4. SmartCreateNewRemoveManyPerformer: 111 milli(s)
  5. FasterSmartCreateNewRemoveManyPerformer: 203 milli(s)
  6. MagicRemoveManyPerformer: 69 milli(s)
  7. ForwardInPlaceRemoveManyPerformer: 72 milli(s)
  8. GuavaArrayListRemoveManyPerformer: 102 milli(s)

Example results for removing 1,000,000 (all) items from 1,000,000 source items (all items are removed but with one-by-one processing, if you know a priori that all items are to be removed, list should be simply cleared):

  1. BetterNaiveRemoveManyPerformer: 58 milli(s)
  2. LinkedRemoveManyPerformer: 88 milli(s)
  3. CreateNewRemoveManyPerformer: 95 milli(s)
  4. SmartCreateNewRemoveManyPerformer: 91 milli(s)
  5. FasterSmartCreateNewRemoveManyPerformer: 48 milli(s)
  6. MagicRemoveManyPerformer: 61 milli(s)
  7. ForwardInPlaceRemoveManyPerformer: 49 milli(s)
  8. GuavaArrayListRemoveManyPerformer: 133 milli(s)

My final conclusions: use hybrid approach - if dealing with LinkedList - simple iteration and removal is best, if dealing with ArrayList - it depends if item order is important - use ForwardInPlaceRemoveManyPerformer then, if item order may be changed - best choice is MagicRemoveManyPerformer. If remove factor is known a priori (you know how many items will be removed vs retained) then some more conditionals may be put to select approach performing even better in particular situation. But known remove factor is not usual case... Google Guava Iterables.removeIf is such a hybrid solution but with slightly different assumption (original list must be changed, new one cannot be created and item order always matters) - these are most common assumptions so removeIf is best choice in most real-life cases.

Notice also that all good approaches (naive is not good!) are good enough - any one of them shold do just fine in real application, but naive approach must be avoided.

At last - my source code for testing.

package WildWezyrListRemovalTesting;

import com.google.common.base.Predicate;
import com.google.common.collect.Iterables;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.List;

public class RemoveManyFromList {

    public static abstract class BaseRemoveManyPerformer {

        protected String performerName() {
            return getClass().getSimpleName();
        }

        protected void info(String msg) {
            System.out.println(performerName() + ": " + msg);
        }

        protected void populateList(List<Integer> items, int itemCnt) {
            for (int i = 0; i < itemCnt; i++) {
                items.add(i);
            }
        }

        protected boolean mustRemoveItem(Integer itemVal, int itemIdx, int removeFactor) {
            if (removeFactor == 0) {
                return false;
            }
            return itemIdx % removeFactor == 0;
        }

        protected abstract List<Integer> removeItems(List<Integer> items, int removeFactor);

        protected abstract List<Integer> createInitialList();

        public void testMe(int itemCnt, int removeFactor) {
            List<Integer> items = createInitialList();
            populateList(items, itemCnt);
            long startMillis = System.currentTimeMillis();
            items = removeItems(items, removeFactor);
            long endMillis = System.currentTimeMillis();
            int chksum = 0;
            for (Integer item : items) {
                chksum += item;
            }
            info("removing took " + (endMillis - startMillis)
                    + " milli(s), itemCnt=" + itemCnt
                    + ", removed items: " + (itemCnt - items.size())
                    + ", remaining items: " + items.size()
                    + ", checksum: " + chksum);
        }
    }
    private List<BaseRemoveManyPerformer> rmps =
            new ArrayList<BaseRemoveManyPerformer>();

    public void addPerformer(BaseRemoveManyPerformer rmp) {
        rmps.add(rmp);
    }
    private Runtime runtime = Runtime.getRuntime();

    private void runGc() {
        for (int i = 0; i < 5; i++) {
            runtime.gc();
        }
    }

    public void testAll(int itemCnt, int removeFactor) {
        runGc();
        for (BaseRemoveManyPerformer rmp : rmps) {
            rmp.testMe(itemCnt, removeFactor);
        }
        runGc();
        System.out.println("\n--------------------------\n");
    }

    public static class NaiveRemoveManyPerformer
            extends BaseRemoveManyPerformer {

        @Override
        public List<Integer> removeItems(List<Integer> items, int removeFactor) {
            if (items.size() > 300000 && items instanceof ArrayList) {
                info("this removeItems is too slow, returning without processing");
                return items;
            }
            int i = 0;
            Iterator<Integer> iter = items.iterator();
            while (iter.hasNext()) {
                Integer item = iter.next();
                if (mustRemoveItem(item, i, removeFactor)) {
                    iter.remove();
                }
                i++;
            }
            return items;
        }

        @Override
        public List<Integer> createInitialList() {
            return new ArrayList<Integer>();
        }
    }

    public static class BetterNaiveRemoveManyPerformer
            extends NaiveRemoveManyPerformer {

        @Override
        public List<Integer> removeItems(List<Integer> items, int removeFactor) {
//            if (items.size() > 300000 && items instanceof ArrayList) {
//                info("this removeItems is too slow, returning without processing");
//                return items;
//            }

            for (int i = items.size(); --i >= 0;) {
                Integer item = items.get(i);
                if (mustRemoveItem(item, i, removeFactor)) {
                    items.remove(i);
                }
            }
            return items;
        }
    }

    public static class LinkedRemoveManyPerformer
            extends NaiveRemoveManyPerformer {

        @Override
        public List<Integer> createInitialList() {
            return new LinkedList<Integer>();
        }
    }

    public static class CreateNewRemoveManyPerformer
            extends NaiveRemoveManyPerformer {

        @Override
        public List<Integer> removeItems(List<Integer> items, int removeFactor) {
            List<Integer> res = createResultList(items, removeFactor);
            int i = 0;

            for (Integer item : items) {
                if (mustRemoveItem(item, i, removeFactor)) {
                    // no-op
                } else {
                    res.add(item);
                }
                i++;
            }

            return res;
        }

        protected List<Integer> createResultList(List<Integer> items, int removeFactor) {
            return new ArrayList<Integer>();
        }
    }

    public static class SmartCreateNewRemoveManyPerformer
            extends CreateNewRemoveManyPerformer {

        @Override
        protected List<Integer> createResultList(List<Integer> items, int removeFactor) {
            int newCapacity = removeFactor == 0 ? items.size()
                    : (int) (items.size() * (removeFactor - 1L) / removeFactor + 1);
            //System.out.println("newCapacity=" + newCapacity);
            return new ArrayList<Integer>(newCapacity);
        }
    }

    public static class FasterSmartCreateNewRemoveManyPerformer
            extends SmartCreateNewRemoveManyPerformer {

        @Override
        public List<Integer> removeItems(List<Integer> items, int removeFactor) {
            List<Integer> res = createResultList(items, removeFactor);

            for (int i = 0; i < items.size(); i++) {
                Integer item = items.get(i);
                if (mustRemoveItem(item, i, removeFactor)) {
                    // no-op
                } else {
                    res.add(item);
                }
            }

            return res;
        }
    }

    public static class ForwardInPlaceRemoveManyPerformer
            extends NaiveRemoveManyPerformer {

        @Override
        public List<Integer> removeItems(List<Integer> items, int removeFactor) {
            int j = 0; // destination idx
            for (int i = 0; i < items.size(); i++) {
                Integer item = items.get(i);
                if (mustRemoveItem(item, i, removeFactor)) {
                    // no-op
                } else {
                    if (j < i) {
                        items.set(j, item);
                    }
                    j++;
                }
            }

            return items.subList(0, j);
        }
    }

    public static class MagicRemoveManyPerformer
            extends NaiveRemoveManyPerformer {

        @Override
        public List<Integer> removeItems(List<Integer> items, int removeFactor) {
            for (int i = 0; i < items.size(); i++) {
                if (mustRemoveItem(items.get(i), i, removeFactor)) {
                    Integer retainedItem = removeSomeFromEnd(items, removeFactor, i);
                    if (retainedItem == null) {
                        items.remove(i);
                        break;
                    }
                    items.set(i, retainedItem);
                }
            }

            return items;
        }

        private Integer removeSomeFromEnd(List<Integer> items, int removeFactor, int lowerBound) {
            for (int i = items.size(); --i > lowerBound;) {
                Integer item = items.get(i);
                items.remove(i);
                if (!mustRemoveItem(item, i, removeFactor)) {
                    return item;
                }
            }
            return null;
        }
    }

    public static class GuavaArrayListRemoveManyPerformer
            extends BaseRemoveManyPerformer {

        @Override
        protected List<Integer> removeItems(List<Integer> items, final int removeFactor) {
            Iterables.removeIf(items, new Predicate<Integer>() {

                public boolean apply(Integer input) {
                    return mustRemoveItem(input, input, removeFactor);
                }
            });

            return items;
        }

        @Override
        protected List<Integer> createInitialList() {
            return new ArrayList<Integer>();
        }
    }

    public void testForOneItemCnt(int itemCnt) {
        testAll(itemCnt, 0);
        testAll(itemCnt, itemCnt);
        testAll(itemCnt, itemCnt - 1);
        testAll(itemCnt, 3);
        testAll(itemCnt, 2);
        testAll(itemCnt, 1);
    }

    public static void main(String[] args) {
        RemoveManyFromList t = new RemoveManyFromList();
        t.addPerformer(new NaiveRemoveManyPerformer());
        t.addPerformer(new BetterNaiveRemoveManyPerformer());
        t.addPerformer(new LinkedRemoveManyPerformer());
        t.addPerformer(new CreateNewRemoveManyPerformer());
        t.addPerformer(new SmartCreateNewRemoveManyPerformer());
        t.addPerformer(new FasterSmartCreateNewRemoveManyPerformer());
        t.addPerformer(new MagicRemoveManyPerformer());
        t.addPerformer(new ForwardInPlaceRemoveManyPerformer());
        t.addPerformer(new GuavaArrayListRemoveManyPerformer());

        t.testForOneItemCnt(1000);
        t.testForOneItemCnt(10000);
        t.testForOneItemCnt(100000);
        t.testForOneItemCnt(200000);
        t.testForOneItemCnt(300000);
        t.testForOneItemCnt(500000);
        t.testForOneItemCnt(1000000);
        t.testForOneItemCnt(10000000);
    }
}
查看更多
唯我独甜
3楼-- · 2020-05-17 04:53

One thing you could try is to use a LinkedList instead of an ArrayList, as with an ArrayList all other elements need to be copied if elements are removed from within the list.

查看更多
Luminary・发光体
4楼-- · 2020-05-17 04:54

Removing a lot of elements from an ArrayList is an O(n^2) operation. I would recommend simply using a LinkedList that's more optimized for insertion and removal (but not for random access). LinkedList has a bit of a memory overhead.

If you do need to keep ArrayList, then you are better off creating a new list.

Update: Comparing with creating a new list:

Reusing the same list, the main cost is coming from deleting the node and updating the appropriate pointers in LinkedList. This is a constant operation for any node.

When constructing a new list, the main cost is coming from creating the list, and initializing array entries. Both are cheap operations. You might incurre the cost of resizing the new list backend array as well; assuming that the final array is larger than half of the incoming array.

So if you were to remove only one element, then LinkedList approach is probably faster. If you were to delete all nodes except for one, probably the new list approach is faster.

There are more complications when you bring memory management and GC. I'd like to leave these out.

The best option is to implement the alternatives yourself and benchmark the results when running your typical load.

查看更多
爱情/是我丢掉的垃圾
5楼-- · 2020-05-17 04:57

Rather than muddying my first answer, which is already rather long, here's a second, related option: you can create your own ArrayList, and flag things as "removed". This algoritm makes the assumptions:

  • it's better to waste time (lower speed) during construction than to do the same during the removal operation. In other words, it moves the speed penalty from one location to another.
  • it's better to waste memory now, and time garbage collecting after the result is computeed rather than spend the time up front (you're always stuck with time garbage collecting...).
  • once removal begins, elements will never be added to the list (otherwise there are issues with re-allocating the flags object)

Also, this is, again, not tested so there's prlolly syntax errors.

public class FlaggedList extends ArrayList {
  private Vector<Boolean> flags = new ArrayList();
  private static final String IN = Boolean.TRUE;  // not removed
  private static final String OUT = Boolean.FALSE; // removed
  private int removed = 0;

  public MyArrayList(){ this(1000000); }
  public MyArrayList(int estimate){
    super(estimate);
    flags = new ArrayList(estimate);
  }

  public void remove(int idx){
    flags.set(idx, OUT);
    removed++;
  }

  public boolean isRemoved(int idx){ return flags.get(idx); }
}

and the iterator - more work may be needed to keep it synchronized, and many methods are left out, this time:

public class FlaggedListIterator implements ListIterator
{
  int idx = 0;

  public FlaggedList list;
  public FlaggedListIterator(FlaggedList list)
  {
    this.list = list;
  }
  public boolean hasNext() {
    while(idx<list.size() && list.isRemoved(idx++)) ;
    return idx < list.size();
  }
}
查看更多
神经病院院长
6楼-- · 2020-05-17 05:01

I'm sorry, but all these answers are missing the point, I think: You probably don't have to, and probably shouldn't, use a List.

If this kind of "query" is common, why not build an ordered data structure that eliminates the need to traverse all the data nodes? You don't tell us enough about the problem, but given the example you provide a simple tree could do the trick. There's an insertion overhead per item, but you can very quickly find the subtree containing nodes that match , and you therefore avoid most of the comparisons you're doing now.

Furthermore:

  • Depending on the exact problem, and the exact data structure you set up, you can speed up deletion -- if the nodes you want to kill do reduce to a subtree or something of the sort, you just drop that subtree, rather than updating a whole slew of list nodes.

  • Each time you remove a list item, you are updating pointers -- eg lastNode.next and nextNode.prev or something -- but if it turns out you also want to remove the nextNode, then the pointer update you just caused is thrown away by a new update.)

查看更多
我命由我不由天
7楼-- · 2020-05-17 05:03

Use Apache Commons Collections. Specifically this function. This is implemented in essentially the same way that people are suggesting that you implement it (i.e. create a new list and then add to it).

查看更多
登录 后发表回答