Example code for Scrapy process_links and process_

2020-07-14 11:59发布

I am new to Scrapy and I was hoping if anyone can give me good example codes of when process_links and process_request are most useful. I see that process_links is used to filter URL's but I don't know how to code it.

Thank you.

回答1:

You mean scrapy.spiders.Rule that is most commonly used in scrapy.CrawlSpider

They do pretty much what the names say or in other words that act as sort of middleware between the time the link is extracted and processed/downloaded.

process_links sits between when link is extracted and turned into request . There are pretty cool use cases for this, just to name a few common ones:

Filter out some links you don't like.
Do redirection manually to avoid bad requests.

example:

def process_links(self, link):
    for link in links:
        #1
        if 'foo' in link.text:
            continue  # skip all links that have "foo" in their text
        yield link 
        #2
        link.url = link.url + '/'  # fix url to avoid unnecessary redirection
        yield link

process_requests sits between request that was just made and before it is being downloaded. It shares some use cases with the process_links but can actually do some other cool stuff like:

Modify headers(e.g. cookies).
Change details like callback, depending on some keywords in the url.

example:

def process_req(self, req):
    # 1
    req = req.replace(headers={'Cookie':'foobar'})
    return req
    # 2
    if 'foo' in req.url:
        return req.replace(callback=self.parse_foo)
    elif 'bar' in req.url:
        return req.replace(callback=self.parse_bar)
    return req

You probably not gonna use them often but these two can be really convenient and easy shortcuts on some occasions.