How to efficiently (performance) remove many items

2020-05-17 03:59发布

I have quite large List named items (>= 1,000,000 items) and some condition denoted by <cond> that selects items to be deleted and <cond> is true for many (maybe half) of items on my list.

My goal is to efficiently remove items selected by <cond> and retain all other items, source list may be modified, new list may be created - best way to do it should be chosen considering performance.

Here is my testing code:

    System.out.println("preparing items");
    List<Integer> items = new ArrayList<Integer>(); // Integer is for demo
    for (int i = 0; i < 1000000; i++) {
        items.add(i * 3); // just for demo
    }

    System.out.println("deleting items");
    long startMillis = System.currentTimeMillis();
    items = removeMany(items);
    long endMillis = System.currentTimeMillis();

    System.out.println("after remove: items.size=" + items.size() + 
            " and it took " + (endMillis - startMillis) + " milli(s)");

and naive implementation:

public static <T> List<T> removeMany(List<T> items) {
    int i = 0;
    Iterator<T> iter = items.iterator();
    while (iter.hasNext()) {
        T item = iter.next();
        // <cond> goes here
        if (/*<cond>: */i % 2 == 0) {
            iter.remove();
        }
        i++;
    }
    return items;
}

As you can see I used item index modulo 2 == 0 as remove condition (<cond>) - just for demonstation purposes.

What better version of removeMany may be provided and why this better version is actually better?

12条回答
Lonely孤独者°
2楼-- · 2020-05-17 04:37

Try implementing recursion into your algorithm.

查看更多
做自己的国王
3楼-- · 2020-05-17 04:42

I would make a new List to add the items to, since removing an item from the middle of a List is quite expensive.

public static List<T> removeMany(List<T> items) {
    List<T> tempList = new ArrayList<T>(items.size()/2); //if about half the elements are going to be removed
    Iterator<T> iter = items.iterator();
    while (item : items) {
        // <cond> goes here
        if (/*<cond>: */i % 2 != 0) {
            tempList.add(item);
        }
    }
    return tempList;
}

EDIT: I haven't tested this, so there may very well be small syntax errors.

Second EDIT: Using a LinkedList is better when you don't need random access but fast add times.

BUT...

The constant factor for ArrayList is smaller than that for LinkedList (Ref). Since you can make a reasonable guess of how many elements will be removed (you said "about half" in your question), adding an element to the end of an ArrayList is O(1) as long as you don't have to re-allocate it. So, if you can make a reasonable guess, I would expect the ArrayList to be marginally faster than the LinkedList in most cases. (This applies to the code I have posted. In your naive implementatation, I think LinkedList will be faster).

查看更多
We Are One
4楼-- · 2020-05-17 04:42

Since speed is the most important metric, there's the possibility of using more memory and doing less recreation of lists (as mentioned in my comment). Actual performance impact would be fully dependent on how the functionality is used, though.

The algorithm assumes that at least one of the following is true:

  • all elements of the original list do not need to be tested. This could happen if we're really looking for the first N elements that match our condition, rather than all elements that match our condition.
  • it's more expensive to copy the list into new memory. This could happen if the original list uses more than 50% of allocated memory, so working in-place could be better or if memory operations turn out to be slower (that would be an unexpected result).
  • the speed penalty of removing elements from the list is too large to accept all at once, but spreading that penalty across multiple operations is acceptable, even if the overall penalty is larger than taking it all at once. This is like taking out a $200K mortgage: paying $1000 per month for 30 years is affordable on a monthly basis and has the benefits of owning a home and equity, even though the overall payment is 360K over the life of the loan.

Disclaimer: There's prolly syntax errors - I didn't try compiling anything.

First, subclass the ArrayList

public class ConditionalArrayList extends ArrayList {

  public Iterator iterator(Condition condition)
  { 
    return listIterator(condition);
  }

  public ListIterator listIterator(Condition condition)
  {
    return new ConditionalArrayListIterator(this.iterator(),condition); 
  }

  public ListIterator listIterator(){ return iterator(); }
  public iterator(){ 
    throw new InvalidArgumentException("You must specify a condition for the iterator"); 
  }
}

Then we need the helper classes:

public class ConditionalArrayListIterator implements ListIterator
{
  private ListIterator listIterator;
  Condition condition;

  // the two following flags are used as a quick optimization so that 
  // we don't repeat tests on known-good elements unnecessarially.
  boolean nextKnownGood = false;
  boolean prevKnownGood = false;

  public ConditionalArrayListIterator(ListIterator listIterator, Condition condition)
  {
    this.listIterator = listIterator;
    this.condition = condition;
  }

  public void add(Object o){ listIterator.add(o); }

  /**
   * Note that this it is extremely inefficient to 
   * call hasNext() and hasPrev() alternatively when
   * there's a bunch of non-matching elements between
   * two matching elements.
   */
  public boolean hasNext()
  { 
     if( nextKnownGood ) return true;

     /* find the next object in the list that 
      * matches our condition, if any.
      */
     while( ! listIterator.hasNext() )
     {
       Object next = listIterator.next();
       if( condition.matches(next) ) {
         listIterator.set(next);
         nextKnownGood = true;
         return true;
       }
     }

     nextKnownGood = false;
     // no matching element was found.
     return false;
  }

  /**
   *  See hasPrevious for efficiency notes.
   *  Copy & paste of hasNext().
   */
  public boolean hasPrevious()
  { 
     if( prevKnownGood ) return true;

     /* find the next object in the list that 
      * matches our condition, if any.
      */
     while( ! listIterator.hasPrevious() )
     {
       Object prev = listIterator.next();
       if( condition.matches(prev) ) {
         prevKnownGood = true;
         listIterator.set(prev);
         return true;
       }
     }

     // no matching element was found.
     prevKnwonGood = false;
     return false;
  }

  /** see hasNext() for efficiency note **/
  public Object next()
  {
     if( nextKnownGood || hasNext() ) 
     { 
       prevKnownGood = nextKnownGood;
       nextKnownGood = false;
       return listIterator.next();
     }

     throw NoSuchElementException("No more matching elements");
  }

  /** see hasNext() for efficiency note; copy & paste of next() **/
  public Object previous()
  {
     if( prevKnownGood || hasPrevious() ) 
     { 
       nextKnownGood = prevKnownGood;
       prevKnownGood = false;
       return listIterator.previous();                        
     }
     throw NoSuchElementException("No more matching elements");
  }

  /** 
   * Note that nextIndex() and previousIndex() return the array index
   * of the value, not the number of results that this class has returned.
   * if this isn't good for you, just maintain your own current index and
   * increment or decriment in next() and previous()
   */
  public int nextIndex(){ return listIterator.previousIndex(); }
  public int previousIndex(){ return listIterator.previousIndex(); }

  public remove(){ listIterator.remove(); }
  public set(Object o) { listIterator.set(o); }
}

and, of course, we need the condition interface:

/** much like a comparator... **/
public interface Condition
{
  public boolean matches(Object obj);
}

And a condition with which to test

public class IsEvenCondition {
{
  public boolean matches(Object obj){ return (Number(obj)).intValue() % 2 == 0;
}

and we're finally ready for some test code


    Condition condition = new IsEvenCondition();

    System.out.println("preparing items");
    startMillis = System.currentTimeMillis();
    List<Integer> items = new ArrayList<Integer>(); // Integer is for demo
    for (int i = 0; i < 1000000; i++) {
        items.add(i * 3); // just for demo
    }
    endMillis = System.currentTimeMillis();
    System.out.println("It took " + (endmillis-startmillis) + " to prepare the list. ");

    System.out.println("deleting items");
    startMillis = System.currentTimeMillis();
    // we don't actually ever remove from this list, so 
    // removeMany is effectively "instantaneous"
    // items = removeMany(items);
    endMillis = System.currentTimeMillis();
    System.out.println("after remove: items.size=" + items.size() + 
            " and it took " + (endMillis - startMillis) + " milli(s)");
    System.out.println("--> NOTE: Nothing is actually removed.  This algorithm uses extra"
                       + " memory to avoid modifying or duplicating the original list.");

    System.out.println("About to iterate through the list");
    startMillis = System.currentTimeMillis();
    int count = iterate(items, condition);
    endMillis = System.currentTimeMillis();
    System.out.println("after iteration: items.size=" + items.size() + 
            " count=" + count + " and it took " + (endMillis - startMillis) + " milli(s)");
    System.out.println("--> NOTE: this should be somewhat inefficient."
                       + " mostly due to overhead of multiple classes."
                       + " This algorithm is designed (hoped) to be faster than "
                       + " an algorithm where all elements of the list are used.");

    System.out.println("About to iterate through the list");
    startMillis = System.currentTimeMillis();
    int total = addFirst(30, items, condition);
    endMillis = System.currentTimeMillis();
    System.out.println("after totalling first 30 elements: total=" + total + 
            " and it took " + (endMillis - startMillis) + " milli(s)");

...

private int iterate(List<Integer> items, Condition condition)
{
  // the i++ and return value are really to prevent JVM optimization
  // - just to be safe.
  Iterator iter = items.listIterator(condition);
  for( int i=0; iter.hasNext()); i++){ iter.next(); }
  return i;
}

private int addFirst(int n, List<Integer> items, Condition condition)
{
  int total = 0;
  Iterator iter = items.listIterator(condition);
  for(int i=0; i<n;i++)
  {
    total += ((Integer)iter.next()).intValue();
  }
}

查看更多
Animai°情兽
5楼-- · 2020-05-17 04:44

Maybe a List isn't the optimal data structure for you? Can you change it? Perhaps you can use a tree where the items are sorted into in a way that deleting one node removes all items that meet the condition? Or that at least speed up your operations?

In your simplistic example using two lists (one with the items where i % 2 != 0 is true and one with the items where i % 2 != 0 is false) could serve well. But this is of course very domain dependent.

查看更多
Animai°情兽
6楼-- · 2020-05-17 04:52

As others have said, your first inclination should be to just build up a second list.

But, if you'd like to also try out editing the list in-place, the efficient way to do that is to use Iterables.removeIf() from Guava. If its argument is a list, it coalesces the retained elements toward the front then simply chops off the end -- much faster than removing() interior elements one by one.

查看更多
ゆ 、 Hurt°
7楼-- · 2020-05-17 04:52

I would imagine that building a new list, rather than modifying the existing list, would be more performant - especially when the number of items is as large as you indicate. This assumes, your list is an ArrayList, not a LinkedList. For a non-circular LinkedList, insertion is O(n), but removal at an existing iterator position is O(1); in which case your naive algorithm should be sufficiently performant.

Unless the list is a LinkedList, the cost of shifting the list each time you call remove() is likely one of the most expensive parts of the implementation. For array lists, I would consider using:

public static <T> List<T> removeMany(List<T> items) {
    List<T> newList = new ArrayList<T>(items.size());
    Iterator<T> iter = items.iterator();
    while (iter.hasNext()) {
        T item = iter.next();
        // <cond> goes here
        if (/*<cond>: */i++ % 2 != 0) {
            newList.add(item);
        }
    }
    return newList;
}
查看更多
登录 后发表回答