How to get a single item across many sites in scra

2019-06-12 17:07发布

I have this situation:

I want to crawl products details from a specific product detail page which describes the product (Page A), this page contains a link to a page that list sellers of this product (Page B), in each seller is a link to another page (Page C) which contains seller details, here is an example schema:

Page A:

  • product_name
  • link to sellers of this product (Page B)

Page B:

  • list of sellers, each one containing:
    • seller_name
    • selling_price
    • link to the seller details page (Page C)

Page C:

  • seller_address

This is the json I want to obtain after crawling:

{
  "product_name": "product1",
  "sellers": [
    {
      "seller_name": "seller1",
      "seller_price": 100,
      "seller_address": "address1",
    },
    (...)
  ]
}

What I have tried: passing the product information from in parse method to second parse method in meta object, this works fine on 2 levels, but I have 3, and I want a single item.

Is this possible in scrapy?

EDIT:

as requested here is a minified example of what I am trying to do, I know it wont work as expected, but I can not figure out how to make it return only 1 composed object:

import scrapy

class ExampleSpider(scrapy.Spider):
    name = 'examplespider'
    allowed_domains = ["example.com"]

    start_urls = [
        'http://example.com/products/product1'
    ]

    def parse(self, response):

        # assume this object was obtained after
        # some xpath processing
        product_name = 'product1'
        link_to_sellers = 'http://example.com/products/product1/sellers'

        yield scrapy.Request(link_to_sellers, callback=self.parse_sellers, meta={
            'product': {
                'product_name': product_name,
                'sellers': []
            }
        })

    def parse_sellers(self, response):
        product = response.meta['product']

        # assume this object was obtained after
        # some xpath processing
        sellers = [
            {
                seller_name = 'seller1',
                seller_price = 100,
                seller_detail_url = 'http://example.com/sellers/seller1',
            },
            {
                seller_name = 'seller2',
                seller_price = 100,
                seller_detail_url = 'http://example.com/sellers/seller2',
            },
            {
                seller_name = 'seller3',
                seller_price = 100,
                seller_detail_url = 'http://example.com/sellers/seller3',
            }
        ]

        for seller in sellers:
            product['sellers'].append(seller)
            yield scrapy.Request(seller['seller_detail_url'], callback=self.parse_seller, meta={'seller': seller})

    def parse_seller(self, response):
        seller = response.meta['seller']

        # assume this object was obtained after
        # some xpath processing
        seller_address = 'seller_address1'

        seller['seller_address'] = seller_address

        yield seller

标签: python scrapy
2条回答
神经病院院长
2楼-- · 2019-06-12 17:56

I think a pipeline could help.

Assuming yielded seller is in the following format (which can be done by some trivial modification of the code):

seller = {
    'product_name': 'product1',
    'seller': {
        'seller_name': 'seller1',
        'seller_price': 100,
        'seller_address': 'address1',
    }
}

A pipeline like the following will collect sellers by their product_name and export to a file named 'items.jl' after crawling (Note this is just a sketch of the idea so it is not guaranteed to work):

class CollectorPipeline(object):

    def __init__(self):
        self.collection = {}

    def open_spider(self, spider):
        self.collection = {}

    def close_spider(self, spider):
        with open("items.jl", "w") as fp:
            for _, product in self.collection.items():
                fp.write(json.dumps(product))
                fp.write("\n")

    def process_item(self, item, spider):
        product = self.collection.get(item["product_name"], dict())
        product["product_name"] = item["product_name"]
        sellers = product.get("sellers", list())
        sellers.append(item["seller"])

        return item

BTW your need to modify your settings.py to make the pipeline effective, as described in scrapy document.

查看更多
狗以群分
3楼-- · 2019-06-12 18:01

You need to change your logic a bit, so as it to query one seller address at a time only and once that completes you query other sellers.

def parse_sellers(self, response):
    meta = response.meta

    # assume this object was obtained after
    # some xpath processing
    sellers = [
        {
            seller_name = 'seller1',
            seller_price = 100,
            seller_detail_url = 'http://example.com/sellers/seller1',
        },
        {
            seller_name = 'seller2',
            seller_price = 100,
            seller_detail_url = 'http://example.com/sellers/seller2',
        },
        {
            seller_name = 'seller3',
            seller_price = 100,
            seller_detail_url = 'http://example.com/sellers/seller3',
        }
    ]

    current_seller = sellers.pop()
    if current_seller:
       meta['pending_sellers'] = sellers
       meta['current_seller'] = current_seller
       yield scrapy.Request(current_seller['seller_detail_url'], callback=self.parse_seller, meta=meta)
    else:
       yield product


    # for seller in sellers:
    #     product['sellers'].append(seller)
    #     yield scrapy.Request(seller['seller_detail_url'], callback=self.parse_seller, meta={'seller': seller})

def parse_seller(self, response):
    meta = response.meta
    current_seller = meta['current_seller']
    sellers = meta['pending_sellers']
    # assume this object was obtained after
    # some xpath processing
    seller_address = 'seller_address1'

    current_seller['seller_address'] = seller_address

    meta['product']['sellers'].append(current_seller)
    if sellers:
        current_seller = sellers.pop()
        meta['pending_sellers'] = sellers
        meta['current_seller'] = current_seller

        yield scrapy.Request(current_seller['seller_detail_url'], callback=self.parse_seller, meta=meta)
    else:
        yield meta['product']

But this is till not a great approach, reason being a seller may be selling multiple items. So when you reach the a item by same seller again then your request for seller address would get rejected by dupe filter. You can fix that by adding dont_filter=True to the request but that would mean too many unnecessary hits to the website

So you need to add DB handling directly in code to check if you already have a sellers details, if yes then use them, if not then you need fetch the details.

查看更多
登录 后发表回答