I have two algorithms of finding primes, in Python. The inner loop of each one seems to be executed the same number of times, and is equally simple. However, one of them takes 10 times as much as the other. My question is:
Why? Is this some quirk of Python that can be optimized away (how?), or am I missing something else?
The problem I am solving is essentially from http://www.spoj.pl/problems/PRIME1/. In my case, I have N = 10 ** 9, delta = 10 ** 5, and I want to find all primes between N-delta and delta. I also have smallprimes
, a pre-made list of all primes less than or equal to square root of N. The first algorithm is very simple:
def range_f1(lo, hi, smallprimes):
"""Finds all primes p with lo <= p <= hi.
smallprimes is the sorted list of all primes up to (at least) square root of hi.
hi & lo might be large, but hi-lo+1 miust fit into a long."""
primes =[]
for i in xrange(hi-lo+1):
n = lo + i
isprime = True
for p in smallprimes:
if n % p == 0:
isprime = False
break
if isprime:
primes.append(n)
return primes
Calling range_f1(N-delta,N,smallprimes)
takes a long time (about 10 seconds). The inner loop is called 195170 times. I also have a version of this algorithm that replaces the loop with a list comprehension (That is the one I actually use for profiling; see the end of the question) but that is no faster.
The second version is (an ugly implementation of) the sieve of Eratosthenes:
def range_sieve(lo, hi, smallprimes):
"""Parameters as before"""
# So ugly! But SO FAST!! How??
delta = hi-lo+1
iamprime = [True] * delta # iamprime[i] says whether lo + i is prime
if lo<= 1:
iamprime[1 - lo] = False
def sillyfun(): # For profiling & speed comparison
pass
for p in smallprimes:
rem = lo % p
if rem == 0:
iamprime[0] = False
for i in xrange(p - rem, delta, p):
iamprime[i] = False
sillyfun()
if p >= lo and p <= hi:
iamprime[p - lo] = True
return [p + lo for (p, ami) in enumerate(iamprime) if ami]
This is about 10 times as fast, takes less than 2 seconds. However, the inner loop (sillyfun()) is called 259982 times, more than in the last case. I am at a loss to explain why this is fast.
I thought that maybe the reason is because the inner loop of the first algorithm contains modular arithmetic, while the second only has an assignment. However, the following seems to imply that assignment is no faster than modular arithmetic:
>>> from timeit import timeit
>>> timeit("10**9 % 2341234")
0.23445186469234613
>>> timeit("a[5000]=False", "a = [True] * 10000")
0.47924750212666822
Here's the (less readable) version of the first algorithm I actually use:
def range_f2(lo, hi, smallprimes):
primes =[]
for i in xrange(hi-lo+1):
n = lo + i
try:
(1 for p in smallprimes if n % p ==0).next()
except StopIteration:
primes.append(n)
return primes
Here are the result of calling the profiler for range_f2() (note the number of time generating expression is evaluated):
>>> from cProfile import run as prof
>>> prof("range_f2(N-delta,N,sp)")
200005 function calls in 13.866 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 13.866 13.866 <string>:1(<module>)
195170 12.632 0.000 12.632 0.000 prime1.py:101(<genexpr>)
1 1.224 1.224 13.865 13.865 prime1.py:90(range_f2)
4832 0.009 0.000 0.009 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Here is the profiler result for range_sieve():
>>> prof("range_sieve(N-delta,N,sp)")
259985 function calls in 1.370 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.003 0.003 1.370 1.370 <string>:1(<module>)
1 0.877 0.877 1.367 1.367 prime1.py:39(range_sieve)
259982 0.490 0.000 0.490 0.000 prime1.py:51(sillyfun)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Finally,here is he complete code that generates the lists of small primes (in a very silly way) so that you can check what results you get: http://pastebin.com/7sfN4mG4
UPDATE By popular demand, the profiling data for the first chunk of code. No data on how many times the inner loop is executed, but it seems pretty clear it's the same as the third.
>>> prof("range_f1(N-delta,N,sp)")
4835 function calls in 14.184 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 14.184 14.184 <string>:1(<module>)
1 14.162 14.162 14.183 14.183 prime1.py:69(range_f1)
4832 0.021 0.000 0.021 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
The difference is an algorithmic one. In the first version, trial division, you test each candidate against all small primes - that you don't stop when the small prime exceeds
candidate ** 0.5
doesn't matter for therange(10**9 - 10**5 ,10**9)
if smallprimes has a good upper bound, but it would if the length of the range were much larger in relation to the upper bound. For composites, that doesn't incur much cost, since most of them have at least one small prime divisor. But for primes, you have to go all the way toN**0.5
. There are roughly10**5/log(10**9)
primes in that interval, each of them is trial-divided by about10**4.5/log(10**4.5)
primes, so that makes about1.47*10**7
trial divisions against primes.On the other hand, with the sieve, you only cross off composites, each composite is crossed off as many times as it has prime divisors, while primes aren't crossed off at all (so primes are free). The number of prime divisors of
n
is bounded by (a multiple of)log(n)
(that's a crude upper bound, usually greatly overestimating), so that gives an upper bound of10**5*log(10**9)
(times a small constant) crossings-off, about2*10**6
. In addition to that, the crossing off may be less work than a division (don't know about Python, it is for C arrays). So you're doing less work with the sieve.Edit: collected the actual numbers for
10**9-10**5
to10**9
.The sieve does only 259987 crossings-off, you see that the crude upper bound above is overestimating by a large factor. The trial division needs more than 20 million divisions, 16433632 of those for primes (
x/log x
underestimates the number of primes, forx = 10**4.5
by about 10%), 3435183 are used for the 3297 numbers in that range whose smallest prime factor is larger thann**(1/3)
.