I've built a crawler that had to run on about 5M pages (by increasing the url ID) and then parses the pages which contain the info' I need.
after using an algorithm which run on the urls (200K) and saved the good and bad results I found that the I'm wasting a lot of time. I could see that there are a a few returning subtrahends which I can use to check the next valid url.
you can see the subtrahends quite fast (a little ex' of the few first "good IDs") -
510000011 # +8
510000029 # +18
510000037 # +8
510000045 # +8
510000052 # +7
510000060 # +8
510000078 # +18
510000086 # +8
510000094 # +8
510000102 # +8
510000110 # etc'
510000128
510000136
510000144
510000151
510000169
510000177
510000185
510000193
510000201
after crawling about 200K urls which gave me only 14K good results I knew I was wasting my time and need to optimize it, so I run some statistics and built a function that will check the urls while increasing the id with 8\18\17\8 (top returning subtrahends ) etc'.
this is the function -
def checkNextID(ID):
global numOfRuns, curRes, lastResult
while ID < lastResult:
try:
numOfRuns += 1
if numOfRuns % 10 == 0:
time.sleep(3) # sleep every 10 iterations
if isValid(ID + 8):
parseHTML(curRes)
checkNextID(ID + 8)
return 0
if isValid(ID + 18):
parseHTML(curRes)
checkNextID(ID + 18)
return 0
if isValid(ID + 7):
parseHTML(curRes)
checkNextID(ID + 7)
return 0
if isValid(ID + 17):
parseHTML(curRes)
checkNextID(ID + 17)
return 0
if isValid(ID+6):
parseHTML(curRes)
checkNextID(ID + 6)
return 0
if isValid(ID + 16):
parseHTML(curRes)
checkNextID(ID + 16)
return 0
else:
checkNextID(ID + 1)
return 0
except Exception, e:
print "somethin went wrong: " + str(e)
what is basically does is -checkNextID(ID) is getting the first id I know that contain the data minus 8 so the first iteration will match the first "if isValid" clause (isValid(ID + 8) will return True).
lastResult is a variable which saves the last known url id, so we'll run until numOfRuns is
isValid() is a function that gets an ID + one of the subtrahends and returns True if the url contains what I need and saves a soup object of the url to a global varibale named - 'curRes', it returns False if the url doesn't contain the data I need.
parseHTML is a function that gets the soup object (curRes), parses the data I need and then saves the data to a csv, then returns True.
if isValid() returns True, we'll call parseHTML() and then try to check the next ID+the subtrahends (by calling checkNextID(ID + subtrahends), if none of them will return what I'm looking for I'll increase it with 1 and check again until I'll find the next valid url.
you can see the rest of the code here
after running the code I got about 950~ good results and suddenly an exception had raised -
"somethin went wrong: maximum recursion depth exceeded while calling a Python object"
I could see on WireShark that the scipt stuck on id - 510009541 (I started my script with 510000003), the script tried getting the url with that ID a few times before I noticed the error and stopped it.
I was really exciting to see that I got the same results but 25x-40x times faster then my old script, with fewer HTTP requests, it's very precise, I have missed only 1 result for 1000 good results, which is find by me, it's impossible to rum 5M times, I had my old script running for 30 hours and got 14-15K results when my new script gave me 960~ results in 5-10 minutes.
I read about stack limitations, but there must be a solution for the algorithm I'm trying to implement in Python (I can't go back to my old "algorithm", it will never end).
Thanks!
You can increase the capacity of the stack by the following :
Python don't have a great support for recursion because of it's lack of TRE (Tail Recursion Elimination).
This means that each call to your recursive function will create a function call stack and because there is a limit of stack depth (by default is 1000) that you can check out by
sys.getrecursionlimit
(of course you can change it using sys.setrecursionlimit but it's not recommended) your program will end up by crashing when it hits this limit.As other answer has already give you a much nicer way for how to solve this in your case (which is to replace recursion by simple loop) there is another solution if you still want to use recursion which is to use one of the many recipes of implementing TRE in python like this one.
N.B: My answer is meant to give you more insight on why you get the error, and I'm not advising you to use the TRE as i already explained because in your case a loop will be much better and easy to read.
Instead of doing recursion, the parts of the code with
checkNextID(ID + 18)
and similar could be replaced withID+=18
, and then if you remove all instances ofreturn 0
, then it should do the same thing but as a simple loop. You should then put areturn 0
at the end and make your variables non-global.this turns the recursion in to a loop: