So I've wrote a small script to download pictures from a website. It goes through a 7 alpha charactor value, where the first char is always a number. The problem is if I want to stop the script and start it up again I have to start all over.
Can I seed itertools.product somehow with the last value I got so I don't have to go through them all again.
Thanks for any input.
here is part of the code:
numbers = '0123456789'
alnum = numbers + 'abcdefghijklmnopqrstuvwxyz'
len7 = itertools.product(numbers, alnum, alnum, alnum, alnum, alnum, alnum) # length 7
for p in itertools.chain(len7):
currentid = ''.join(p)
#semi static vars
url = 'http://mysite.com/images/'
url += currentid
#Need to get the real url cause the redirect
print "Trying " + url
req = urllib2.Request(url)
res = openaurl(req)
if res == "continue": continue
finalurl = res.geturl()
#ok we have the full url now time to if it is real
try: file = urllib2.urlopen(finalurl)
except urllib2.HTTPError, e:
print e.code
im = cStringIO.StringIO(file.read())
img = Image.open(im)
writeimage(img)
Once you get a fair way along the iterator, it's going to take a while to get to the spot using dropwhile.
You probably should adapt a recipe like this so that you can save the state with a pickle between runs.
Make sure that your script can only run once at a time, or you will need something more elaborate, such as a server process that hands out the ids to the scripts
here's a solution based on pypy's library code (thanks to agf's suggestion in the comments).
the state is available via the
.state
attribute and can be reset via.goto(state)
wherestate
is an index into the sequence (starting at 0). there's a demo at the end (you need to scroll down, i'm afraid).this is way faster than discarding values.
you should test it more - i may have made a dumb mistake - but the idea is quite simple, so you should be able to fix it :o) you're free to use my changes; no idea what the original pypy licence is.
also
state
isn't really the full state - it doesn't include the original arguments - it's just an index into the sequence. maybe it would have been better to call it index, but there are already indici[sic]es in the code...update
here's a simpler version that is the same idea but works by transforming a sequence of numbers. so you just
imap
it overcount(n)
to get the sequence offset byn
.(the downside here is that if you want to stop and restart you need to have kept track yourself of how many you have used)
If your input sequences don't have any duplicate values, this may be faster than
dropwhile
to advanceproduct
as it doesn't require you to compare all of the dropped values by calculating the correct point to resume iteration.