Scrapy: how to use items in spider and how to send

2019-03-09 12:48发布

问题:

I am new to scrapy and my task is simple:

For a given e-commerce website:

  • crawl all website pages

  • look for products page

  • If the URL point to a product page

  • Create an Item

  • Process the item to store it in a database

I created the spider but products are just printed in a simple file.

My question is about the project structure: how to use items in spider and how to send items to pipelines ?

I can't find a simple example of a project using items and pipelines.

回答1:

  • How to use items in my spider ?

Well, the main purpose of items is to store the data you crawled. Scrapy.Items are basically dictionnaries. To declare your items, you will have to create a class and add Scrapy.Field in it:

import scrapy

class Product(scrapy.Item):
    url = scrapy.Field()
    title = scrapy.Field()

You can now use it in your spider by importing your Product.

For advanced informations, I let you check the doc here

  • How to send items to pipeline ?

First you need to tell to your spider to use your custom pipeline.

In the settings.py file:

ITEM_PIPELINES = {
    'myproject.pipelines.CustomPipeline': 300,
}

You can now write your pipeline and play with your item.

In the pipeline.py file:

from scrapy.exceptions import DropItem

class CustomPipeline(object):
   def __init__(self):
        # Create your database connection

    def process_item(self, item, spider):
        # Here you can index your item
        return item

Finally, in your spider, you need to yield your item once it is filled.

spider.py example:

import scrapy
from myspider.items import Product

class MySpider(scrapy.Spider):
    name = "test"
    start_urls = [
        'http://www.exemple.com',
    ]
def parse(self, response):
    doc = Product()
    doc['url'] = response.url
    doc['title'] = response.xpath('//div/p/text()')
    yield doc # Will go to your pipeline

Hope this helps, here is the doc for pipelines: Item Pipeline