Stripping of the addiotional items in xpath

2019-05-10 18:32发布

I'm trying to scrape the items from this website.

Items are: Brand, Model and Price. Because of the complexity of the page structure, spider is using 2 xpath selectors.

Brand and Model items are from one xpath, price is from the different xpath. I'm using ( | ) operator as @har07 suggested. Xpaths were tested individually for each item and they were working and extracting the needed items correctly. However, after joining the 2 xpaths, price item started parsing additional items, like commas and prices aren't matched with Brand/Model items, when outputting to csv.

This is how the parse fragment of the spider looks:

def parse(self, response):
    sel = Selector(response)
    titles = sel.xpath('//table[@border="0"]//td[@class="compact"] | //table[@border="0"]//td[@class="cl-price-cont"]//span[4]')

    items = []
    for t in titles:
        item = AltaItem()
        item["brand"] = t.xpath('div[@class="cl-prod-name"]/a/text()').re('^([\w\-]+)') 
        item["model"] = t.xpath('div[@class="cl-prod-name"]/a/text()').re('\s+(.*)$') 
        item["price"] = t.xpath('text()').extract()

        items.append(item)

    return(items)

and that's what csv looks after scraping:

enter image description here

any suggestions how to fix this?

Thank you.

1条回答
走好不送
2楼-- · 2019-05-10 19:11

Basically, the issue is being caused by your titles xpath. The xpath goes down too deeply, to the point where you need to use join two xpaths in order to be able to scrape the brand/model field and the price field.

Modifying the titles xpath to a single xpath includes both of the repeating elements for brand/model and price (and subsequently changing the brand, model and price xpaths) means that you no longer get mismatches where the brand and model are in one item, and the price is in the next item.

def parse(self, response):
    sel = Selector(response)
    titles = sel.xpath('//table[@class="table products cl"]//tr[@valign="middle"]')
    items = []
    for t in titles:
        item = AltaItem()
        item["brand"] = t.xpath('td[@class="compact"]/div[@class="cl-prod-name"]/a/text()').re('^([\w\-]+)')
        item["model"] = t.xpath('td[@class="compact"]/div[@class="cl-prod-name"]/a/text()').re('\s+(.*)$')
        item["price"] = t.xpath('td[@class="cl-price-cont"]//span[4]/text()').extract()
        items.append(item)
    return(items)
查看更多
登录 后发表回答