I am using Scrapy's SitemapSpider go through a list of Shopify stores. I am pulling all of the products from their respective collections with XPath. Normally, this wouldn't be difficult to do. However, the html of the collections pages varies from site to site in a couple of ways. I'll try to summarize some points that are necessary to understand what exactly I'm trying to do:
- All product links are inside div elements
- The number of div ancestors my a tag(s) have is inconsistent
- The depth of the a tag(s) inside the div element is inconsistent
- There can be either one or two a tags containing href's inside the div element. It varies site to site. If there are two they will be identical
- The class names of the div elements are inconsistent, so I've removed them for simplicity
So the code containing my desired product links can have multiple a tags in a div element at inconsistent depths like this:
<!-- Product One -->
<div>
<div>
<div>
<a href="/product_1">
</a>
</div>
<a href="/product_1">
</a>
</div>
</div>
<!-- Product Two -->
<div>
<div>
<div>
<a href="/product_2">
</a>
</div>
<a href="/product_2">
</a>
</div>
</div>
<!-- Product Three-->
<div>
<div>
<div>
<a href="/product_3">
</a>
</div>
<a href="/product_3">
</a>
</div>
</div>
Or it can be on the complete opposite end of the spectrum, having one a tag inside a div element at a depth of one like this:
<div>
<a href="/product_1">
</a>
</div>
<div>
<a href="/product_2">
</a>
</div>
<div>
<a href="/product_3">
</a>
</div>
So I figured I would select the very first div element that has a tags containing the keyword "product", extracting only the href from the first a tag in the div element.
<div> <!-- I want to select this div element -->
<div>
<div>
<a href="/product_1">
</a>
</div>
<a href="/product_1">
</a>
</div>
</div>
The code I have right now looks like this:
product_links = response.xpath('//div//a[contains(@href, "product")][1]/@href').extract()
I'm still receiving duplicate values though so obviously it's not doing what I want it to.
If anyone actually read all of that, absolutely any help would be appreciated!
Since your problem is mainly about having duplicates in the reponse, convert the
response
into aSet
. This give single instance of all data.Without using set :
Using
Set
:Suppose the question is only for single
div
, then the best course is to use theextract_first()
command to to extract only first matched element. And benifit of using this is that it avoids anIndexError
and returnsNone
when it doesn’t find any element matching the selection.Before :
So, it should be :