I've recently taken a look at the python-requests module and I'd like to write a simple web crawler with it. Given a collection of start urls, I want to write a Python function that searches the webpage content of the start urls for other urls and then calls the same function again as a callback with the new urls as input, and so on. At first, I thought that event hooks would be the right tool for this purpose but its documentation part is quite sparse. On another page I read that functions which are used for event hooks have to return the same object that was passed to them. So event hooks are obviously not feasible for this kind of task. Or I simply didn't get it right...
Here is some pseudocode of what I want to do (borrowed from a pseudo Scrapy spider):
import lxml.html
def parse(response):
for url in lxml.html.parse(response.url).xpath('//@href'):
return Request(url=url, callback=parse)
Can someone give me an insight on how to do it with python-requests? Are event hooks the right tool for that or do I need something different? (Note: Scrapy is not an option for me due to various reasons.) Thanks a lot!
Here is how I would do it:
I haven't tested the code but the general idea is there.
Note that I am using
grequests
instead ofrequests
for performance boost.grequest
is basicallygevent+request
, and in my experience it is much faster for this sort of tasks because of you retrieve links asynchronous withgevent
.Edit: here is the same algorithm without using recursion: