How to collect data from multiple pages into singl

2019-03-15 23:39发布

I am trying to scrape data from a site.The data is structured as multiple objects each with a set of data. For example, people with names, ages, and occupations.

My problem is that this data is split across two levels in the website.
The first page is, say, a list of names and ages with a link to each persons profile page.
Their profile page lists their occupation.

I already have a spider written with scrapy in python which can collect the data from the top layer and crawl through multiple paginations.
But, how can I collect the data from the inner pages while keeping it linked to the appropriate object?

Currently, I have the output structured with json as

   {[name='name',age='age',occupation='occupation'],
   [name='name',age='age',occupation='occupation']} etc

Can the parse function reach across pages like that?

标签： python json scrapy web-crawler

1条回答

小情绪 Triste *

2楼-- · 2019-03-16 00:00

here is a way you need to deal. you need to yield/return item once when item has all attributes

yield Request(page1,
              callback=self.page1_data)

def page1_data(self, response):
    hxs = HtmlXPathSelector(response)
    i = TestItem()
    i['name']='name'
    i['age']='age'
    url_profile_page = 'url to the profile page'

    yield Request(url_profile_page,
                  meta={'item':i},
    callback=self.profile_page)


def profile_page(self,response):
    hxs = HtmlXPathSelector(response)
    old_item=response.request.meta['item']
    # parse other fileds
    # assign them to old_item

    yield old_item

0人赞添加讨论(0) 举报

How to collect data from multiple pages into singl

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间