Scrapy XPath - Can't get text within span

2019-07-22 20:09发布

I'm trying to reach the address information on a site. Here's an example of my code:

companytype_list = sel.xpath('''.//li[@class="type"]/p/text()''').extract()
headquarters_list = sel.xpath('''.//li[@class="vcard hq"]/p/span[3]/text()''').extract()
companysize_list = sel.xpath('''.//li[@class="company-size"]/p/text()''').extract()

And here's an example of how addresses are formatted on the site:

<li class="type">
    <h4>Type</h4>
    <p>
        Privately Held
    </p>
</li>
<li class="vcard hq">
    <h4>Headquarters</h4>
    <p class="adr" itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
        <span class="street-address" itemprop="streetAddress">Kornhamnstorg 49</span>
        <span class="street-address" itemprop="streetAddress"></span>
        <span class="locality" itemprop="addressLocality">Stockholm,</span>
        <abbr class="region" title="Stockholm" itemprop="addressRegion">Stockholm</abbr>
        <span class="postal-code" itemprop="postalCode">S-11127</span>
        <span class="country-name" itemprop="addressCountry">Sweden</span>
    </p>
</li>
<li class="company-size">
    <h4>Company Size</h4>
    <p>
        11-50 employees
    </p>

But when I run the scrapy script I get an IndexError: list index out of range for the address (vcard hq). I've tried to rewrite the code to get the data but it does not work. The rest of the spider works fine. Am I missing something?

2条回答
放荡不羁爱自由
2楼-- · 2019-07-22 20:46

Your example works fine. But I guess your xpath expressions failed on another page or html part.

The problem is the use of indexes (span[3]) in the headquarters_list xpath expression. Using indexes you heavily depend on:

1. The total number of the span elements

2. On the exact order of the span elements

In general the use of indexes tend to make xpath expressions more fragile and more likely to fail. Thus, if possible, I would always avoid the use of indexes. In your example you actually take the locality of the address info. The span element can also easily be referenced by its class name which makes your expression much more robust:

//li[@class="vcard hq"]/p/span[@class='locality']/text()
查看更多
仙女界的扛把子
3楼-- · 2019-07-22 20:51

Here is my testing code according to your problem description:

# -*- coding: utf-8 -*-
from scrapy.selector import Selector


html_text = """
<li class="type">
    <h4>Type</h4>
    <p>
        Privately Held
    </p>
</li>
<li class="vcard hq">
    <h4>Headquarters</h4>
    <p class="adr" itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
        <span class="street-address" itemprop="streetAddress">Kornhamnstorg 49</span>
        <span class="street-address" itemprop="streetAddress"></span>
        <span class="locality" itemprop="addressLocality">Stockholm,</span>
        <abbr class="region" title="Stockholm" itemprop="addressRegion">Stockholm</abbr>
        <span class="postal-code" itemprop="postalCode">S-11127</span>
        <span class="country-name" itemprop="addressCountry">Sweden</span>
    </p>
</li>
<li class="company-size">
    <h4>Company Size</h4>
    <p>
        11-50 employees
    </p>
"""


sel = Selector(text=html_text)

companytype_list = sel.xpath(
    '''.//li[@class="type"]/p/text()''').extract()
headquarters_list = sel.xpath(
    '''.//li[@class="vcard hq"]/p/span[3]/text()''').extract()
companysize_list = sel.xpath(
    '''.//li[@class="company-size"]/p/text()''').extract()

It doesn't raise any exception. So chances are there exist web pages with a different structure causing errors.

It's a good practice to not using index directly in xpath rules. dron22's answer gives an awesome explanation.

查看更多
登录 后发表回答