Scrapy XPath - Can't get text within span

I'm trying to reach the address information on a site. Here's an example of my code:

companytype_list = sel.xpath('''.//li[@class="type"]/p/text()''').extract()
headquarters_list = sel.xpath('''.//li[@class="vcard hq"]/p/span[3]/text()''').extract()
companysize_list = sel.xpath('''.//li[@class="company-size"]/p/text()''').extract()

And here's an example of how addresses are formatted on the site:

<li class="type">
    <h4>Type</h4>
    <p>
        Privately Held
    </p>
</li>
<li class="vcard hq">
    <h4>Headquarters</h4>
    <p class="adr" itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
        <span class="street-address" itemprop="streetAddress">Kornhamnstorg 49</span>
        <span class="street-address" itemprop="streetAddress"></span>
        <span class="locality" itemprop="addressLocality">Stockholm,</span>
        <abbr class="region" title="Stockholm" itemprop="addressRegion">Stockholm</abbr>
        <span class="postal-code" itemprop="postalCode">S-11127</span>
        <span class="country-name" itemprop="addressCountry">Sweden</span>
    </p>
</li>
<li class="company-size">
    <h4>Company Size</h4>
    <p>
        11-50 employees
    </p>

But when I run the scrapy script I get an IndexError: list index out of range for the address (vcard hq). I've tried to rewrite the code to get the data but it does not work. The rest of the spider works fine. Am I missing something?

标签： python xpath web-scraping scrapy

2条回答

放荡不羁爱自由

2楼-- · 2019-07-22 20:46

Your example works fine. But I guess your xpath expressions failed on another page or html part.

The problem is the use of indexes (span[3]) in the headquarters_list xpath expression. Using indexes you heavily depend on:

1. The total number of the span elements

2. On the exact order of the span elements

In general the use of indexes tend to make xpath expressions more fragile and more likely to fail. Thus, if possible, I would always avoid the use of indexes. In your example you actually take the locality of the address info. The span element can also easily be referenced by its class name which makes your expression much more robust:

//li[@class="vcard hq"]/p/span[@class='locality']/text()

0人赞添加讨论(0) 举报

仙女界的扛把子

3楼-- · 2019-07-22 20:51

Here is my testing code according to your problem description:

# -*- coding: utf-8 -*-
from scrapy.selector import Selector


html_text = """
<li class="type">
    <h4>Type</h4>
    <p>
        Privately Held
    </p>
</li>
<li class="vcard hq">
    <h4>Headquarters</h4>
    <p class="adr" itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
        <span class="street-address" itemprop="streetAddress">Kornhamnstorg 49</span>
        <span class="street-address" itemprop="streetAddress"></span>
        <span class="locality" itemprop="addressLocality">Stockholm,</span>
        <abbr class="region" title="Stockholm" itemprop="addressRegion">Stockholm</abbr>
        <span class="postal-code" itemprop="postalCode">S-11127</span>
        <span class="country-name" itemprop="addressCountry">Sweden</span>
    </p>
</li>
<li class="company-size">
    <h4>Company Size</h4>
    <p>
        11-50 employees
    </p>
"""


sel = Selector(text=html_text)

companytype_list = sel.xpath(
    '''.//li[@class="type"]/p/text()''').extract()
headquarters_list = sel.xpath(
    '''.//li[@class="vcard hq"]/p/span[3]/text()''').extract()
companysize_list = sel.xpath(
    '''.//li[@class="company-size"]/p/text()''').extract()

It doesn't raise any exception. So chances are there exist web pages with a different structure causing errors.

It's a good practice to not using index directly in xpath rules. dron22's answer gives an awesome explanation.

0人赞添加讨论(0) 举报

Scrapy XPath - Can't get text within span

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间