How to collect data from multiple pages into singl

2019-03-15 23:39发布

I am trying to scrape data from a site.The data is structured as multiple objects each with a set of data. For example, people with names, ages, and occupations.

My problem is that this data is split across two levels in the website.
The first page is, say, a list of names and ages with a link to each persons profile page.
Their profile page lists their occupation.

I already have a spider written with scrapy in python which can collect the data from the top layer and crawl through multiple paginations.
But, how can I collect the data from the inner pages while keeping it linked to the appropriate object?

Currently, I have the output structured with json as

   {[name='name',age='age',occupation='occupation'],
   [name='name',age='age',occupation='occupation']} etc

Can the parse function reach across pages like that?

1条回答
小情绪 Triste *
2楼-- · 2019-03-16 00:00

here is a way you need to deal. you need to yield/return item once when item has all attributes

yield Request(page1,
              callback=self.page1_data)

def page1_data(self, response):
    hxs = HtmlXPathSelector(response)
    i = TestItem()
    i['name']='name'
    i['age']='age'
    url_profile_page = 'url to the profile page'

    yield Request(url_profile_page,
                  meta={'item':i},
    callback=self.profile_page)


def profile_page(self,response):
    hxs = HtmlXPathSelector(response)
    old_item=response.request.meta['item']
    # parse other fileds
    # assign them to old_item

    yield old_item
查看更多
登录 后发表回答