What's the fastest way to compare two large li

2020-07-30 03:28发布

问题:

I'm in need of a method to quickly return the number of differences between two large lists. The contents of each list item is either 1 or 0 (single integers), and the amount of items in each list will always be 307200.

This is a sample of my current code:

list1 = <list1> # should be a list of integers containing 1's or 0's
list2 = <list2> # same rule as above, in a slightly different order

diffCount = 0
for index, item in enumerate(list1):
    if item != list2[index]:
        diffCount += 1

percent = float(diffCount) / float(307200)

The above works but it is way too slow for my purposes. What I would like to know is if there is a quicker way to obtain the number of differences between lists, or the percentage of items that differ?

I have looked at a few similar threads on this site but they all seem to work slightly different from what I want, and the set() examples don't seem to work for my purposes. :P

回答1:

You can get at least another 10X speedup if you use NumPy arrays instead of lists.

import random
import time
import numpy as np
list1 = [random.choice((0,1)) for x in xrange(307200)]
list2 = [random.choice((0,1)) for x in xrange(307200)]
a1 = np.array(list1)
a2 = np.array(list2)

def foo1():
    start = time.clock()
    counter = 0
    for i in xrange(307200):
        if list1[i] != list2[i]:
            counter += 1
    print "%d, %f" % (counter, time.clock()-start)

def foo2():
    start = time.clock()
    ct = np.sum(a1!=a2)
    print "%d, %f" % (ct, time.clock()-start)

foo1() #153490, 0.096215
foo2() #153490, 0.010224


回答2:

If possible, use Paul/JayP's answer of using numpy (with xor), if you can only use python's stdlib, itertools' izip in a list comprehension seems the fastest:

import random
import time
import numpy
import itertools
list1 = [random.choice((0,1)) for x in xrange(307200)]
list2 = [random.choice((0,1)) for x in xrange(307200)]
a1 = numpy.array(list1)
a2 = numpy.array(list2)

def given():
  diffCount = 0
  for index, item in enumerate(list1):
      if item != list2[index]:
          diffCount += 1
  return diffCount

def xrange_iter():
  counter = 0
  for i in xrange(len(list1)):
    if list1[i] != list2[i]:
      counter += 1
  return counter

def np_not_eq():
  return numpy.sum(a1!=a2)

def np_xor():
  return numpy.sum(a1^a2)

def np_not_eq_plus_array():
  arr1 = numpy.array(list1)
  arr2 = numpy.array(list2)
  return numpy.sum(arr1!=arr2)

def np_xor_plus_array():
  arr1 = numpy.array(list1)
  arr2 = numpy.array(list2)
  return numpy.sum(arr1^arr2)

def enumerate_comprehension():
  return len([0 for i,x in enumerate(list1) if x != list2[i]])

def izip_comprehension():
  return len([0 for a,b in itertools.izip(list1, list2) if a != b])

def zip_comprehension():
  return len([0 for a,b in zip(list1, list2) if a != b])

def functional():
  return sum(map(lambda (a,b): a^b, zip(list1,list2)))

def bench(func):
  diff = []
  for i in xrange(100):
    start = time.clock()
    result = func()
    stop = time.clock()
    diff.append(stop - start)
  print "%25s -- %d, %f" % (func.__name__, result, sum(diff)/float(len(diff)))

bench(given)
bench(xrange_iter)
bench(np_not_eq)
bench(np_xor)
bench(np_not_eq_plus_array)
bench(np_xor_plus_array)
bench(enumerate_comprehension)
bench(zip_comprehension)
bench(izip_comprehension)
bench(functional)

I got this (on Python 2.7.1, Snow Leopard):

                    given -- 153618, 0.046746
              xrange_iter -- 153618, 0.049081
                np_not_eq -- 153618, 0.003069
                   np_xor -- 153618, 0.001869
     np_not_eq_plus_array -- 153618, 0.081671
        np_xor_plus_array -- 153618, 0.080536
  enumerate_comprehension -- 153618, 0.037587
        zip_comprehension -- 153618, 0.083983
       izip_comprehension -- 153618, 0.034506
               functional -- 153618, 0.117359


回答3:

I don't actually know if this is faster, but you might experiment with some of the "functional" methods python offers. It's usually better for loops to be run by internal, hand-coded subroutines.

Something like this:

diffCount = sum(map(lambda (a,b): a^b, zip(list1,list2)))


回答4:

>>> import random
>>> import time
>>> list1 = [random.choice((0,1)) for x in xrange(307200)]
>>> list2 = [random.choice((0,1)) for x in xrange(307200)]
>>> def foo():
...   start = time.clock()
...   counter = 0
...   for i in xrange(307200):
...     if list1[i] != list2[i]:
...       counter += 1
...   print "%d, %f" % (counter, time.clock()-start)
...
>>> foo()
153901, 0.068593

Is 7 hundredths of a second too slow for your application?



回答5:

I would also try the following stdlib-only method:

import itertools as it, operator as op

def list_difference_count(list1, list2):
    return sum(it.starmap(op.ne, it.izip(list1, list2)))

>>> list_difference_count([1, 2, 3, 4], [1, 2, 1, 2])
2

This works correctly for len(list1) == len(list2). If you are sure that the list items are always integers, you can substitute op.xor for op.ne (could improve performance).

The percentage of difference is: float(list_difference_count(l1, l2))/len(l1).