Stripping of the addiotional items in xpath

I'm trying to scrape the items from this website.

Items are: Brand, Model and Price. Because of the complexity of the page structure, spider is using 2 xpath selectors.

Brand and Model items are from one xpath, price is from the different xpath. I'm using ( | ) operator as @har07 suggested. Xpaths were tested individually for each item and they were working and extracting the needed items correctly. However, after joining the 2 xpaths, price item started parsing additional items, like commas and prices aren't matched with Brand/Model items, when outputting to csv.

This is how the parse fragment of the spider looks:

def parse(self, response):
    sel = Selector(response)
    titles = sel.xpath('//table[@border="0"]//td[@class="compact"] | //table[@border="0"]//td[@class="cl-price-cont"]//span[4]')

    items = []
    for t in titles:
        item = AltaItem()
        item["brand"] = t.xpath('div[@class="cl-prod-name"]/a/text()').re('^([\w\-]+)') 
        item["model"] = t.xpath('div[@class="cl-prod-name"]/a/text()').re('\s+(.*)$') 
        item["price"] = t.xpath('text()').extract()

        items.append(item)

    return(items)

and that's what csv looks after scraping:

enter image description here

any suggestions how to fix this?

Thank you.

标签： python regex xpath scrapy

1条回答

走好不送

2楼-- · 2019-05-10 19:11

Basically, the issue is being caused by your titles xpath. The xpath goes down too deeply, to the point where you need to use join two xpaths in order to be able to scrape the brand/model field and the price field.

Modifying the titles xpath to a single xpath includes both of the repeating elements for brand/model and price (and subsequently changing the brand, model and price xpaths) means that you no longer get mismatches where the brand and model are in one item, and the price is in the next item.

def parse(self, response):
    sel = Selector(response)
    titles = sel.xpath('//table[@class="table products cl"]//tr[@valign="middle"]')
    items = []
    for t in titles:
        item = AltaItem()
        item["brand"] = t.xpath('td[@class="compact"]/div[@class="cl-prod-name"]/a/text()').re('^([\w\-]+)')
        item["model"] = t.xpath('td[@class="compact"]/div[@class="cl-prod-name"]/a/text()').re('\s+(.*)$')
        item["price"] = t.xpath('td[@class="cl-price-cont"]//span[4]/text()').extract()
        items.append(item)
    return(items)

0人赞添加讨论(0) 举报

Stripping of the addiotional items in xpath

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间