XPath and Scrapy - Scraping links when the depth a

2019-07-23 04:32发布

I am using Scrapy's SitemapSpider go through a list of Shopify stores. I am pulling all of the products from their respective collections with XPath. Normally, this wouldn't be difficult to do. However, the html of the collections pages varies from site to site in a couple of ways. I'll try to summarize some points that are necessary to understand what exactly I'm trying to do:

  • All product links are inside div elements
  • The number of div ancestors my a tag(s) have is inconsistent
  • The depth of the a tag(s) inside the div element is inconsistent
  • There can be either one or two a tags containing href's inside the div element. It varies site to site. If there are two they will be identical
  • The class names of the div elements are inconsistent, so I've removed them for simplicity

So the code containing my desired product links can have multiple a tags in a div element at inconsistent depths like this:

<!-- Product One -->

<div>
  <div>
    <div>
      <a href="/product_1">
      </a>
      
    </div>

    <a href="/product_1">
    </a>
  </div>
</div>

<!-- Product Two -->

<div>
  <div>
    <div>
      <a href="/product_2">
      </a>
      
    </div>

    <a href="/product_2">
    </a>
  </div>
</div>

<!-- Product Three-->

<div>
  <div>
    <div>
      <a href="/product_3">
      </a>
      
    </div>

    <a href="/product_3">
    </a>
  </div>
</div>

Or it can be on the complete opposite end of the spectrum, having one a tag inside a div element at a depth of one like this:

<div>
  <a href="/product_1">
  </a>
  
</div>

<div>
  <a href="/product_2">
  </a>
 
</div>

<div>
  <a href="/product_3">
  </a>
  
</div>

So I figured I would select the very first div element that has a tags containing the keyword "product", extracting only the href from the first a tag in the div element.

    <div> <!-- I want to select this div element -->
      <div>
        <div>
          <a href="/product_1">
          </a>
          
        </div>

        <a href="/product_1">
        </a>
      </div>
    </div>

The code I have right now looks like this:

product_links = response.xpath('//div//a[contains(@href, "product")][1]/@href').extract()

I'm still receiving duplicate values though so obviously it's not doing what I want it to.

If anyone actually read all of that, absolutely any help would be appreciated!

1条回答
一纸荒年 Trace。
2楼-- · 2019-07-23 04:44

Since your problem is mainly about having duplicates in the reponse, convert the response into a Set. This give single instance of all data.

Without using set :

>>> response.xpath('//div//a[contains(@href, "product")]/@href').extract()
[u'/product_1', u'/product_1', u'/product_2', u'/product_2', u'/product_3', u'/product_3']

Using Set:

>>> set(response.xpath('//div//a[contains(@href, "product")]/@href').extract())
set([u'/product_3', u'/product_2', u'/product_1'])

Suppose the question is only for single div, then the best course is to use the extract_first() command to to extract only first matched element. And benifit of using this is that it avoids an IndexError and returns None when it doesn’t find any element matching the selection.

Before :

>>> response.xpath('//div//a[contains(@href, "product")]/@href').extract_first()
[u'/product_1', u'/product_1']

So, it should be :

>>> response.xpath('//div//a[contains(@href, "product")]/@href').extract_first()
u'/product_1'
查看更多
登录 后发表回答