How to get a single item across many sites in scra

I have this situation:

I want to crawl products details from a specific product detail page which describes the product (Page A), this page contains a link to a page that list sellers of this product (Page B), in each seller is a link to another page (Page C) which contains seller details, here is an example schema:

Page A:

product_name
link to sellers of this product (Page B)

Page B:

list of sellers, each one containing:
- seller_name
- selling_price
- link to the seller details page (Page C)

Page C:

seller_address

This is the json I want to obtain after crawling:

{
  "product_name": "product1",
  "sellers": [
    {
      "seller_name": "seller1",
      "seller_price": 100,
      "seller_address": "address1",
    },
    (...)
  ]
}

What I have tried: passing the product information from in parse method to second parse method in meta object, this works fine on 2 levels, but I have 3, and I want a single item.

Is this possible in scrapy?

EDIT:

as requested here is a minified example of what I am trying to do, I know it wont work as expected, but I can not figure out how to make it return only 1 composed object:

import scrapy

class ExampleSpider(scrapy.Spider):
    name = 'examplespider'
    allowed_domains = ["example.com"]

    start_urls = [
        'http://example.com/products/product1'
    ]

    def parse(self, response):

        # assume this object was obtained after
        # some xpath processing
        product_name = 'product1'
        link_to_sellers = 'http://example.com/products/product1/sellers'

        yield scrapy.Request(link_to_sellers, callback=self.parse_sellers, meta={
            'product': {
                'product_name': product_name,
                'sellers': []
            }
        })

    def parse_sellers(self, response):
        product = response.meta['product']

        # assume this object was obtained after
        # some xpath processing
        sellers = [
            {
                seller_name = 'seller1',
                seller_price = 100,
                seller_detail_url = 'http://example.com/sellers/seller1',
            },
            {
                seller_name = 'seller2',
                seller_price = 100,
                seller_detail_url = 'http://example.com/sellers/seller2',
            },
            {
                seller_name = 'seller3',
                seller_price = 100,
                seller_detail_url = 'http://example.com/sellers/seller3',
            }
        ]

        for seller in sellers:
            product['sellers'].append(seller)
            yield scrapy.Request(seller['seller_detail_url'], callback=self.parse_seller, meta={'seller': seller})

    def parse_seller(self, response):
        seller = response.meta['seller']

        # assume this object was obtained after
        # some xpath processing
        seller_address = 'seller_address1'

        seller['seller_address'] = seller_address

        yield seller

标签： python scrapy

2条回答

神经病院院长

2楼-- · 2019-06-12 17:56

I think a pipeline could help.

Assuming yielded seller is in the following format (which can be done by some trivial modification of the code):

seller = {
    'product_name': 'product1',
    'seller': {
        'seller_name': 'seller1',
        'seller_price': 100,
        'seller_address': 'address1',
    }
}

A pipeline like the following will collect sellers by their product_name and export to a file named 'items.jl' after crawling (Note this is just a sketch of the idea so it is not guaranteed to work):

class CollectorPipeline(object):

    def __init__(self):
        self.collection = {}

    def open_spider(self, spider):
        self.collection = {}

    def close_spider(self, spider):
        with open("items.jl", "w") as fp:
            for _, product in self.collection.items():
                fp.write(json.dumps(product))
                fp.write("\n")

    def process_item(self, item, spider):
        product = self.collection.get(item["product_name"], dict())
        product["product_name"] = item["product_name"]
        sellers = product.get("sellers", list())
        sellers.append(item["seller"])

        return item

BTW your need to modify your settings.py to make the pipeline effective, as described in scrapy document.

0人赞添加讨论(0) 举报

狗以群分

3楼-- · 2019-06-12 18:01

You need to change your logic a bit, so as it to query one seller address at a time only and once that completes you query other sellers.

def parse_sellers(self, response):
    meta = response.meta

    # assume this object was obtained after
    # some xpath processing
    sellers = [
        {
            seller_name = 'seller1',
            seller_price = 100,
            seller_detail_url = 'http://example.com/sellers/seller1',
        },
        {
            seller_name = 'seller2',
            seller_price = 100,
            seller_detail_url = 'http://example.com/sellers/seller2',
        },
        {
            seller_name = 'seller3',
            seller_price = 100,
            seller_detail_url = 'http://example.com/sellers/seller3',
        }
    ]

    current_seller = sellers.pop()
    if current_seller:
       meta['pending_sellers'] = sellers
       meta['current_seller'] = current_seller
       yield scrapy.Request(current_seller['seller_detail_url'], callback=self.parse_seller, meta=meta)
    else:
       yield product


    # for seller in sellers:
    #     product['sellers'].append(seller)
    #     yield scrapy.Request(seller['seller_detail_url'], callback=self.parse_seller, meta={'seller': seller})

def parse_seller(self, response):
    meta = response.meta
    current_seller = meta['current_seller']
    sellers = meta['pending_sellers']
    # assume this object was obtained after
    # some xpath processing
    seller_address = 'seller_address1'

    current_seller['seller_address'] = seller_address

    meta['product']['sellers'].append(current_seller)
    if sellers:
        current_seller = sellers.pop()
        meta['pending_sellers'] = sellers
        meta['current_seller'] = current_seller

        yield scrapy.Request(current_seller['seller_detail_url'], callback=self.parse_seller, meta=meta)
    else:
        yield meta['product']

But this is till not a great approach, reason being a seller may be selling multiple items. So when you reach the a item by same seller again then your request for seller address would get rejected by dupe filter. You can fix that by adding dont_filter=True to the request but that would mean too many unnecessary hits to the website

So you need to add DB handling directly in code to check if you already have a sellers details, if yes then use them, if not then you need fetch the details.

0人赞添加讨论(0) 举报

How to get a single item across many sites in scra

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间