I am trying to scrape data from a site.The data is structured as multiple objects each with a set of data. For example, people with names, ages, and occupations.
My problem is that this data is split across two levels in the website.
The first page is, say, a list of names and ages with a link to each persons profile page.
Their profile page lists their occupation.
I already have a spider written with scrapy in python which can collect the data from the top layer and crawl through multiple paginations.
But, how can I collect the data from the inner pages while keeping it linked to the appropriate object?
Currently, I have the output structured with json as
{[name='name',age='age',occupation='occupation'],
[name='name',age='age',occupation='occupation']} etc
Can the parse function reach across pages like that?
here is a way you need to deal. you need to yield/return item once when item has all attributes