XPath and Scrapy - Scraping links when the depth a

I am using Scrapy's SitemapSpider go through a list of Shopify stores. I am pulling all of the products from their respective collections with XPath. Normally, this wouldn't be difficult to do. However, the html of the collections pages varies from site to site in a couple of ways. I'll try to summarize some points that are necessary to understand what exactly I'm trying to do:

All product links are inside div elements
The number of div ancestors my a tag(s) have is inconsistent
The depth of the a tag(s) inside the div element is inconsistent
There can be either one or two a tags containing href's inside the div element. It varies site to site. If there are two they will be identical
The class names of the div elements are inconsistent, so I've removed them for simplicity

So the code containing my desired product links can have multiple a tags in a div element at inconsistent depths like this:

<!-- Product One -->

<div>
  <div>
    <div>
      <a href="/product_1">
      </a>
      
    </div>

    <a href="/product_1">
    </a>
  </div>
</div>

<!-- Product Two -->

<div>
  <div>
    <div>
      <a href="/product_2">
      </a>
      
    </div>

    <a href="/product_2">
    </a>
  </div>
</div>

<!-- Product Three-->

<div>
  <div>
    <div>
      <a href="/product_3">
      </a>
      
    </div>

    <a href="/product_3">
    </a>
  </div>
</div>

Or it can be on the complete opposite end of the spectrum, having one a tag inside a div element at a depth of one like this:

<div>
  <a href="/product_1">
  </a>
  
</div>

<div>
  <a href="/product_2">
  </a>
 
</div>

<div>
  <a href="/product_3">
  </a>
  
</div>

So I figured I would select the very first div element that has a tags containing the keyword "product", extracting only the href from the first a tag in the div element.

    <div> <!-- I want to select this div element -->
      <div>
        <div>
          <a href="/product_1">
          </a>
          
        </div>

        <a href="/product_1">
        </a>
      </div>
    </div>

The code I have right now looks like this:

product_links = response.xpath('//div//a[contains(@href, "product")][1]/@href').extract()

I'm still receiving duplicate values though so obviously it's not doing what I want it to.

If anyone actually read all of that, absolutely any help would be appreciated!

标签： python xpath scrapy

1条回答

一纸荒年 Trace。

2楼-- · 2019-07-23 04:44

Since your problem is mainly about having duplicates in the reponse, convert the response into a Set. This give single instance of all data.

Without using set :

>>> response.xpath('//div//a[contains(@href, "product")]/@href').extract()
[u'/product_1', u'/product_1', u'/product_2', u'/product_2', u'/product_3', u'/product_3']

Using Set:

>>> set(response.xpath('//div//a[contains(@href, "product")]/@href').extract())
set([u'/product_3', u'/product_2', u'/product_1'])

Suppose the question is only for single div, then the best course is to use the extract_first() command to to extract only first matched element. And benifit of using this is that it avoids an IndexError and returns None when it doesn’t find any element matching the selection.

Before :

>>> response.xpath('//div//a[contains(@href, "product")]/@href').extract_first()
[u'/product_1', u'/product_1']

So, it should be :

>>> response.xpath('//div//a[contains(@href, "product")]/@href').extract_first()
u'/product_1'

0人赞添加讨论(0) 举报

XPath and Scrapy - Scraping links when the depth a

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间