How to use python-requests and event hooks to writ

2019-03-16 18:20发布

I've recently taken a look at the python-requests module and I'd like to write a simple web crawler with it. Given a collection of start urls, I want to write a Python function that searches the webpage content of the start urls for other urls and then calls the same function again as a callback with the new urls as input, and so on. At first, I thought that event hooks would be the right tool for this purpose but its documentation part is quite sparse. On another page I read that functions which are used for event hooks have to return the same object that was passed to them. So event hooks are obviously not feasible for this kind of task. Or I simply didn't get it right...

Here is some pseudocode of what I want to do (borrowed from a pseudo Scrapy spider):

import lxml.html    

def parse(response):
    for url in lxml.html.parse(response.url).xpath('//@href'):
        return Request(url=url, callback=parse)

Can someone give me an insight on how to do it with python-requests? Are event hooks the right tool for that or do I need something different? (Note: Scrapy is not an option for me due to various reasons.) Thanks a lot!

1条回答
手持菜刀,她持情操
2楼-- · 2019-03-16 18:37

Here is how I would do it:

import grequests
from bs4 import BeautifulSoup


def get_urls_from_response(r):
    soup = BeautifulSoup(r.text)
    urls = [link.get('href') for link in soup.find_all('a')]
    return urls


def print_url(args):
    print args['url']


def recursive_urls(urls):
    """
    Given a list of starting urls, recursively finds all descendant urls
    recursively
    """
    if len(urls) == 0:
        return
    rs = [grequests.get(url, hooks=dict(args=print_url)) for url in urls]
    responses = grequests.map(rs)
    url_lists = [get_urls_from_response(response) for response in responses]
    urls = sum(url_lists, [])  # flatten list of lists into a list
    recursive_urls(urls)

I haven't tested the code but the general idea is there.

Note that I am using grequests instead of requests for performance boost. grequest is basically gevent+request, and in my experience it is much faster for this sort of tasks because of you retrieve links asynchronous with gevent.


Edit: here is the same algorithm without using recursion:

import grequests
from bs4 import BeautifulSoup


def get_urls_from_response(r):
    soup = BeautifulSoup(r.text)
    urls = [link.get('href') for link in soup.find_all('a')]
    return urls


def print_url(args):
    print args['url']


def recursive_urls(urls):
    """
    Given a list of starting urls, recursively finds all descendant urls
    recursively
    """
    while True:
        if len(urls) == 0:
            break
        rs = [grequests.get(url, hooks=dict(args=print_url)) for url in urls]
        responses = grequests.map(rs)
        url_lists = [get_urls_from_response(response) for response in responses]
        urls = sum(url_lists, [])  # flatten list of lists into a list

if __name__ == "__main__":
    recursive_urls(["INITIAL_URLS"])
查看更多
登录 后发表回答