I'm in need of a method to quickly return the number of differences between two large lists. The contents of each list item is either 1 or 0 (single integers), and the amount of items in each list will always be 307200.
This is a sample of my current code:
list1 = <list1> # should be a list of integers containing 1's or 0's
list2 = <list2> # same rule as above, in a slightly different order
diffCount = 0
for index, item in enumerate(list1):
if item != list2[index]:
diffCount += 1
percent = float(diffCount) / float(307200)
The above works but it is way too slow for my purposes. What I would like to know is if there is a quicker way to obtain the number of differences between lists, or the percentage of items that differ?
I have looked at a few similar threads on this site but they all seem to work slightly different from what I want, and the set() examples don't seem to work for my purposes. :P
You can get at least another 10X speedup if you use NumPy arrays instead of lists.
import random
import time
import numpy as np
list1 = [random.choice((0,1)) for x in xrange(307200)]
list2 = [random.choice((0,1)) for x in xrange(307200)]
a1 = np.array(list1)
a2 = np.array(list2)
def foo1():
start = time.clock()
counter = 0
for i in xrange(307200):
if list1[i] != list2[i]:
counter += 1
print "%d, %f" % (counter, time.clock()-start)
def foo2():
start = time.clock()
ct = np.sum(a1!=a2)
print "%d, %f" % (ct, time.clock()-start)
foo1() #153490, 0.096215
foo2() #153490, 0.010224
If possible, use Paul/JayP's answer of using numpy (with xor), if you can only use python's stdlib, itertools' izip in a list comprehension seems the fastest:
import random
import time
import numpy
import itertools
list1 = [random.choice((0,1)) for x in xrange(307200)]
list2 = [random.choice((0,1)) for x in xrange(307200)]
a1 = numpy.array(list1)
a2 = numpy.array(list2)
def given():
diffCount = 0
for index, item in enumerate(list1):
if item != list2[index]:
diffCount += 1
return diffCount
def xrange_iter():
counter = 0
for i in xrange(len(list1)):
if list1[i] != list2[i]:
counter += 1
return counter
def np_not_eq():
return numpy.sum(a1!=a2)
def np_xor():
return numpy.sum(a1^a2)
def np_not_eq_plus_array():
arr1 = numpy.array(list1)
arr2 = numpy.array(list2)
return numpy.sum(arr1!=arr2)
def np_xor_plus_array():
arr1 = numpy.array(list1)
arr2 = numpy.array(list2)
return numpy.sum(arr1^arr2)
def enumerate_comprehension():
return len([0 for i,x in enumerate(list1) if x != list2[i]])
def izip_comprehension():
return len([0 for a,b in itertools.izip(list1, list2) if a != b])
def zip_comprehension():
return len([0 for a,b in zip(list1, list2) if a != b])
def functional():
return sum(map(lambda (a,b): a^b, zip(list1,list2)))
def bench(func):
diff = []
for i in xrange(100):
start = time.clock()
result = func()
stop = time.clock()
diff.append(stop - start)
print "%25s -- %d, %f" % (func.__name__, result, sum(diff)/float(len(diff)))
bench(given)
bench(xrange_iter)
bench(np_not_eq)
bench(np_xor)
bench(np_not_eq_plus_array)
bench(np_xor_plus_array)
bench(enumerate_comprehension)
bench(zip_comprehension)
bench(izip_comprehension)
bench(functional)
I got this (on Python 2.7.1, Snow Leopard):
given -- 153618, 0.046746
xrange_iter -- 153618, 0.049081
np_not_eq -- 153618, 0.003069
np_xor -- 153618, 0.001869
np_not_eq_plus_array -- 153618, 0.081671
np_xor_plus_array -- 153618, 0.080536
enumerate_comprehension -- 153618, 0.037587
zip_comprehension -- 153618, 0.083983
izip_comprehension -- 153618, 0.034506
functional -- 153618, 0.117359
I don't actually know if this is faster, but you might experiment with some of the "functional" methods python offers. It's usually better for loops to be run by internal, hand-coded subroutines.
Something like this:
diffCount = sum(map(lambda (a,b): a^b, zip(list1,list2)))
>>> import random
>>> import time
>>> list1 = [random.choice((0,1)) for x in xrange(307200)]
>>> list2 = [random.choice((0,1)) for x in xrange(307200)]
>>> def foo():
... start = time.clock()
... counter = 0
... for i in xrange(307200):
... if list1[i] != list2[i]:
... counter += 1
... print "%d, %f" % (counter, time.clock()-start)
...
>>> foo()
153901, 0.068593
Is 7 hundredths of a second too slow for your application?
I would also try the following stdlib-only method:
import itertools as it, operator as op
def list_difference_count(list1, list2):
return sum(it.starmap(op.ne, it.izip(list1, list2)))
>>> list_difference_count([1, 2, 3, 4], [1, 2, 1, 2])
2
This works correctly for len(list1) == len(list2)
. If you are sure that the list items are always integers, you can substitute op.xor
for op.ne
(could improve performance).
The percentage of difference is: float(list_difference_count(l1, l2))/len(l1)
.