I have this situation:
I want to crawl products details from a specific product detail page which describes the product (Page A), this page contains a link to a page that list sellers of this product (Page B), in each seller is a link to another page (Page C) which contains seller details, here is an example schema:
Page A:
- product_name
- link to sellers of this product (Page B)
Page B:
- list of sellers, each one containing:
- seller_name
- selling_price
- link to the seller details page (Page C)
Page C:
- seller_address
This is the json I want to obtain after crawling:
{
"product_name": "product1",
"sellers": [
{
"seller_name": "seller1",
"seller_price": 100,
"seller_address": "address1",
},
(...)
]
}
What I have tried: passing the product information from in parse method to second parse method in meta object, this works fine on 2 levels, but I have 3, and I want a single item.
Is this possible in scrapy?
EDIT:
as requested here is a minified example of what I am trying to do, I know it wont work as expected, but I can not figure out how to make it return only 1 composed object:
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'examplespider'
allowed_domains = ["example.com"]
start_urls = [
'http://example.com/products/product1'
]
def parse(self, response):
# assume this object was obtained after
# some xpath processing
product_name = 'product1'
link_to_sellers = 'http://example.com/products/product1/sellers'
yield scrapy.Request(link_to_sellers, callback=self.parse_sellers, meta={
'product': {
'product_name': product_name,
'sellers': []
}
})
def parse_sellers(self, response):
product = response.meta['product']
# assume this object was obtained after
# some xpath processing
sellers = [
{
seller_name = 'seller1',
seller_price = 100,
seller_detail_url = 'http://example.com/sellers/seller1',
},
{
seller_name = 'seller2',
seller_price = 100,
seller_detail_url = 'http://example.com/sellers/seller2',
},
{
seller_name = 'seller3',
seller_price = 100,
seller_detail_url = 'http://example.com/sellers/seller3',
}
]
for seller in sellers:
product['sellers'].append(seller)
yield scrapy.Request(seller['seller_detail_url'], callback=self.parse_seller, meta={'seller': seller})
def parse_seller(self, response):
seller = response.meta['seller']
# assume this object was obtained after
# some xpath processing
seller_address = 'seller_address1'
seller['seller_address'] = seller_address
yield seller
I think a pipeline could help.
Assuming yielded
seller
is in the following format (which can be done by some trivial modification of the code):A pipeline like the following will collect sellers by their
product_name
and export to a file named 'items.jl' after crawling (Note this is just a sketch of the idea so it is not guaranteed to work):BTW your need to modify your
settings.py
to make the pipeline effective, as described in scrapy document.You need to change your logic a bit, so as it to query one seller address at a time only and once that completes you query other sellers.
But this is till not a great approach, reason being a seller may be selling multiple items. So when you reach the a item by same seller again then your request for seller address would get rejected by dupe filter. You can fix that by adding
dont_filter=True
to the request but that would mean too many unnecessary hits to the websiteSo you need to add DB handling directly in code to check if you already have a sellers details, if yes then use them, if not then you need fetch the details.