Portia/Scrapy - how to replace or add values to ou

2019-06-11 05:24发布

just 2 quick doubts:

1- I want my final JSON file to replace the text extract (for example text extracted is ADD TO CART but I want to change to IN STOCK in my final JSON. Is it possible?

2- I also would like to add some custom data to my final JSON file that is not in the website, for example "Store name"... so every product that I scrape will have the store name after it. Is it possible?

I am using both Portia and Scrapy so your suggestions are welcome in both platforms.

My Scrapy spider code is below:

import scrapy
from __future__ import absolute_import
from scrapy import Request
from scrapy.linkextractors import LinkExtractor
from scrapy.loader import ItemLoader
from scrapy.loader.processors import Identity
from scrapy.spiders import Rule
from ..utils.spiders import BasePortiaSpider
from ..utils.starturls import FeedGenerator, FragmentGenerator
from ..utils.processors import Item, Field, Text, Number, Price, Date, Url, 
Image, Regex
from ..items import PortiaItem


class Advent(BasePortiaSpider):
    name = "advent"
    allowed_domains = [u'www.adventgames.com.au']
    start_urls = [u'http://www.adventgames.com.au/c/4504822/1/all-games-a---k.html',
                  {u'url': u'http://www.adventgames.com.au/Listing/Category/?categoryId=4504822&page=[1-5]',
                   u'fragments': [{u'valid': True,
                                   u'type': u'fixed',
                                   u'value': u'http://www.adventgames.com.au/Listing/Category/?categoryId=4504822&page='},
                                  {u'valid': True,
                                   u'type': u'range',
                                   u'value': u'1-5'}],
                   u'type': u'generated'}]
    rules = [
        Rule(
            LinkExtractor(
                allow=('.*'),
                deny=()
            ),
            callback='parse_item',
            follow=True
        )
    ]
    items = [
        [
            Item(
                PortiaItem,
                None,
                u'.DataViewCell > form > table',
                [
                    Field(
                        u'Title',
                        'tr:nth-child(1) > td > .DataViewItemProductTitle > a *::text',
                        []),
                    Field(
                        u'Price',
                        'tr:nth-child(1) > td > .DataViewItemOurPrice *::text',
                        []),
                    Field(
                        u'Img_src',
                        'tr:nth-child(1) > td > .DataViewItemThumbnailImage > div > a > img::attr(src)',
                        []),
                    Field(
                        u'URL',
                        'tr:nth-child(1) > td > .DataViewItemProductTitle > a::attr(href)',
                        []),
                    Field(
                        u'Stock',
                        'tr:nth-child(2) > td > .DataViewItemAddToCart > .wButton::attr(value)',
                        [])])]]

1条回答
来,给爷笑一个
2楼-- · 2019-06-11 05:37

I have never used the items class variable, it looks very unreadable and difficult to understand.

I would suggest you to have a callback method and parse it like this

def my_callback_func(self, response):

    myitem = PortiaItem()


    for item in response.css(".DataViewCell > form > table"):

        item['Title'] = item.css('tr:nth-child(1) > td > .DataViewItemProductTitle > a *::text').extract_first()

        item['Stock'] = item.css('tr:nth-child(2) > td > .DataViewItemAddToCart > .wButton::attr(value)').extract_first()

        if item['Stock'] == "ADD TO CART":

            item['is_available'] = "YES"

        ...... and so on

        yield item
查看更多
登录 后发表回答