Scrapy项目提取范围问题(Scrapy item extraction scope issue)

2019-10-21 13:37发布

我有我的管道返回Scrapy项目(球员)范围的问题。 我相当肯定我知道是什么问题,但我不知道如何将解决方案集成到我的代码。 我也可以肯定,我现在有管道输送到处理正确编写的代码。 这只是我已经声明了parseRoster()函数内的玩家项目,所以我知道它的范围仅限于该功能。

现在的问题是,当我需要在我的代码申报玩家项目为它是我的管道可见? 我的目标是这个数据让我的数据库。 我会认为这将是我的代码的主循环,如果是这样的话,我怎么能同时返回的项目和我的新申报项目的球员?

我的代码如下:

类NbastatsSpider(scrapy.Spider):名称= “nbaStats”

start_urls = [
    "http://espn.go.com/nba/teams"                                                                              ##only start not allowed because had some issues when navigated to team roster pages
    ]
def parse(self,response):
    items = []                                                                                                  ##array or list that stores TeamStats item
    i=0                                                                                                         ##counter needed for older code

    for division in response.xpath('//div[@id="content"]//div[contains(@class, "mod-teams-list-medium")]'):     
        for team in division.xpath('.//div[contains(@class, "mod-content")]//li'):
            item = TeamStats()


            item['division'] = division.xpath('.//div[contains(@class, "mod-header")]/h4/text()').extract()[0]            
            item['team'] = team.xpath('.//h5/a/text()').extract()[0]
            item['rosterurl'] = "http://espn.go.com" + team.xpath('.//div/span[2]/a[3]/@href').extract()[0]
            items.append(item)
            request = scrapy.Request(item['rosterurl'], callback = self.parseWPNow)
            request.meta['play'] = item

            yield request

    print(item)      

def parseWPNow(self, response):
    item = response.meta['play']
    item = self.parseRoster(item, response)

    return item

def parseRoster(self, item, response):
    players = Player()
    int = 0
    for player in response.xpath("//td[@class='sortcell']"):
        players['name'] = player.xpath("a/text()").extract()[0]
        players['position'] = player.xpath("following-sibling::td[1]").extract()[0]
        players['age'] = player.xpath("following-sibling::td[2]").extract()[0]
        players['height'] = player.xpath("following-sibling::td[3]").extract()[0]
        players['weight'] = player.xpath("following-sibling::td[4]").extract()[0]
        players['college'] = player.xpath("following-sibling::td[5]").extract()[0]
        players['salary'] = player.xpath("following-sibling::td[6]").extract()[0]
        yield players
    item['playerurl'] = response.xpath("//td[@class='sortcell']/a").extract()
    yield item

Answer 1:

按照相关部分Scrapy的数据流 :

该发动机将刮下的项目(由蜘蛛返回)到项目管道和要求(由蜘蛛返回)到调度

换句话说,返回/从蜘蛛得到您的项目实例,然后在使用它们process_item()您的管道的方法。 既然你有多个项目类,通过区分isinstance()内置函数 :

def process_item(self, item, spider):
    if isinstance(item, TeamStats):
        # process team stats

    if isinstance(item, Player):
        # process player


文章来源: Scrapy item extraction scope issue