Scrapy yield a Request, parse in the callback, but

2019-08-16 17:54发布

So I'm trying to test some webpages in scrapy, my idea is to yield a Request to the URLS that satisfy the condition, count the number of certain items on the page, and then within the original condition return True/False depending...

Here is some code to show what i mean:

def filter_categories:
    if condition:
        test = yield Request(url=link, callback = self.test_page, dont_filter=True)
        return (test, None)

def test_page(self, link):
    ... parse the response...
    return True/False depending

I have tried messing around with passing an item in the request, but no matter what the return line get's triggered before test_page is ever called...

So i guess my question becomes is there any way to pass data back to the filter_categories method in a synchronous way so that i can use the result of test_page to return whether or not my test is satisfied?

Any other ideas are also welcome.

2条回答
兄弟一词,经得起流年.
2楼-- · 2019-08-16 18:21

If I understood you correct: you want to yield scrapy.Request to URLS that will have True condition. Am I right? Here some example for it:

def parse(self, response):
    if self.test_page(response):
        item = Item()
        item['url'] = 'xpath or css'
        yield item
    if condition:
        yield Request(url=new_link, callback = self.parse, dont_filter=True)


def test_page(self, link):
    ... parse the response...
    return True/False depending

If you give more info I'll try help more.

It's part of my code

 def parse(self, response):
        if 'tag' in response.url:
            return self.parse_tag(response)
        if 'company' in response.url:
            return self.parse_company(response)

    def parse_tag(self, response):
        try:
            news_list = response.xpath("..//div[contains(@class, 'block block-thumb ')]")
            company = response.meta['company']
            for i in news_list:
                item = Item()
                item['date'] = i.xpath("./div/div/time/@datetime").extract_first()
                item['title'] = i.xpath("./div/h2/a/text()").extract_first()
                item['description'] = i.xpath("./div/p//text()").extract_first()
                item['url'] = i.xpath("./div/h2/a/@href").extract_first()

                item.update(self.get_common_items(company))

                item['post_id'] = response.meta['post_id']

                if item['title']:
                    yield scrapy.Request(item['url'], callback=self.parse_tags, meta={'item': item})

            has_next = response.xpath("//div[contains(@class, 'river-nav')]//li[contains(@class, 'next')]/a/@href").extract_first()
            if has_next:
                next_url = 'https://example.com' + has_next + '/'
                yield scrapy.Request(next_url, callback=self.parse_tag,
                                     meta=response.meta)

def parse_tags(self, response):
    item = response.meta['item']
    item['tags'] = response.xpath(".//div[@class='accordion recirc-accordion']//ul//li[not(contains(@class, 'active'))]//a/text()").extract()

    yield item
查看更多
\"骚年 ilove
3楼-- · 2019-08-16 18:26

Take a look at inline_requests package, which should let you achieve this.

Other solution is to not insist on returning the result from original method (filter_categories in your case), but rather use request chaining with meta attribute of requests and return the result from the last parse method in the chain (test_page in your case).

查看更多
登录 后发表回答