Scrapy - Crawl Multiple Pages Per Item

2019-04-13 12:25发布

I am trying to crawl a few extra pages per item to grab some location information.

At the end of the item before return I check to see if we need to crawl extra pages to grab the information, essentially these pages contain some location details and are a simple get request.

I.e. http://site.com.au/MVC/Offer/GetLocationDetails/?locationId=3761&companyId=206

The above link either returns a select with more pages to crawl - or a dd/dt with the address details. Either way I need to extract this address info and append it to my item['locations']

So far I have (at the end of parse block)

return self.fetchLocations(locations_selector, company_id, item)

locations_selector contains a list of locationIds

Then I have

def fetchLocations(self, locations, company_id, item): #response):
    for location in locations:
        if len(location)>1:
            yield Request("http://site.com.au/MVC/Offer/GetLocationDetails/?locationId="+location+"&companyId="+company_id,
            callback=self.parseLocation,
                meta={'company_id': company_id, 'item': item})

And finally

def parseLocation(self,response):
    hxs = HtmlXPathSelector(response)
    item = response.meta['item']

    dl = hxs.select("//dl")
    if len(dl)>0:
        address = hxs.select("//dl[1]/dd").extract()
        loc = {'address':remove_entities(replace_escape_chars(replace_tags(address[0], token=' '), replace_by=''))}
        yield loc

    locations_select = hxs.select("//select/option/@value").extract()
    if len(locations_select)>0:
        yield self.fetchLocations(locations_select, response.meta['company_id'], item)

Can't seem to get this working....

标签: python scrapy
1条回答
The star\"
2楼-- · 2019-04-13 13:04

This is your code:

def parseLocation(self,response):
    hxs = HtmlXPathSelector(response)
    item = response.meta['item']

    dl = hxs.select("//dl")
    if len(dl)>0:
        address = hxs.select("//dl[1]/dd").extract()
        loc = {'address':remove_entities(replace_escape_chars(replace_tags(address[0], token=' '), replace_by=''))}
        yield loc

    locations_select = hxs.select("//select/option/@value").extract()
    if len(locations_select)>0:
        yield self.fetchLocations(locations_select, response.meta['company_id'], item)

Callbacks must return either requests to other pages or items. In the code above is see requests yielded, but not items. You have yield loc, but loc is a dict not Item subclass.

查看更多
登录 后发表回答