I am new to scrapy
and my task is simple:
For a given e-commerce website:
I created the spider but products are just printed in a simple file.
My question is about the project structure: how to use items in spider and how to send items to pipelines ?
I can't find a simple example of a project using items and pipelines.
- How to use items in my spider ?
Well, the main purpose of items is to store the data you crawled. Scrapy.Items
are basically dictionnaries. To declare your items, you will have to create a class and add Scrapy.Field
in it:
import scrapy
class Product(scrapy.Item):
url = scrapy.Field()
title = scrapy.Field()
You can now use it in your spider by importing your Product.
For advanced informations, I let you check the doc here
- How to send items to pipeline ?
First you need to tell to your spider to use your custom pipeline
.
In the settings.py file:
ITEM_PIPELINES = {
'myproject.pipelines.CustomPipeline': 300,
}
You can now write your pipeline and play with your item.
In the pipeline.py file:
from scrapy.exceptions import DropItem
class CustomPipeline(object):
def __init__(self):
# Create your database connection
def process_item(self, item, spider):
# Here you can index your item
return item
Finally, in your spider, you need to yield
your item once it is filled.
spider.py example:
import scrapy
from myspider.items import Product
class MySpider(scrapy.Spider):
name = "test"
start_urls = [
'http://www.exemple.com',
]
def parse(self, response):
doc = Product()
doc['url'] = response.url
doc['title'] = response.xpath('//div/p/text()')
yield doc # Will go to your pipeline
Hope this helps, here is the doc for pipelines: Item Pipeline