Scrapy: Parsing list items onto separate lines

Tried to adapt the answer to this question to my issue but not successfully.

Here's some example html code:

<div id="provider-region-addresses">
<h3>Contact details</h3>
<h2 class="toggler nohide">Auckland</h2>
    <dl class="clear">
        <dt>More information</dt>
            <dd>North Shore Hospital</dd><dt>Physical address</dt>
                <dd>124 Shakespeare Rd, Takapuna, Auckland 0620</dd><dt>Postal address</dt>
                <dd>Private Bag 93503, Takapuna, Auckland 0740</dd><dt>Postcode</dt>
                <dd>0740</dd><dt>District/town</dt>

                <dd>
                North Shore, Takapuna</dd><dt>Region</dt>
                <dd>Auckland</dd><dt>Phone</dt>
                <dd>(09) 486 8996</dd><dt>Fax</dt>
                <dd>(09) 486 8342</dd><dt>Website</dt>
                <dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd>
    </dl>
    <h2 class="toggler nohide">Auckland</h2>
    <dl class="clear">
        <dt>Physical address</dt>
                <dd>Helensville</dd><dt>Postal address</dt>
                <dd>PO Box 13, Helensville 0840</dd><dt>Postcode</dt>
                <dd>0840</dd><dt>District/town</dt>

                <dd>
                Rodney, Helensville</dd><dt>Region</dt>
                <dd>Auckland</dd><dt>Phone</dt>
                <dd>(09) 420 9450</dd><dt>Fax</dt>
                <dd>(09) 420 7050</dd><dt>Website</dt>
                <dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd>
    </dl>
    <h2 class="toggler nohide">Auckland</h2>
    <dl class="clear">
        <dt>Physical address</dt>
                <dd>Warkworth</dd><dt>Postal address</dt>
                <dd>PO Box 505, Warkworth 0941</dd><dt>Postcode</dt>
                <dd>0941</dd><dt>District/town</dt>

                <dd>
                Rodney, Warkworth</dd><dt>Region</dt>
                <dd>Auckland</dd><dt>Phone</dt>
                <dd>(09) 422 2700</dd><dt>Fax</dt>
                <dd>(09) 422 2709</dd><dt>Website</dt>
                <dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd>
    </dl>
    <h2 class="toggler nohide">Auckland</h2>
    <dl class="clear">
        <dt>More information</dt>
            <dd>Waitakere Hospital</dd><dt>Physical address</dt>
                <dd>55-75 Lincoln Rd, Henderson, Auckland 0610</dd><dt>Postal address</dt>
                <dd>Private Bag 93115, Henderson, Auckland 0650</dd><dt>Postcode</dt>
                <dd>0650</dd><dt>District/town</dt>

                <dd>
                Waitakere, Henderson</dd><dt>Region</dt>
                <dd>Auckland</dd><dt>Phone</dt>
                <dd>(09) 839 0000</dd><dt>Fax</dt>
                <dd>(09) 837 6634</dd><dt>Website</dt>
                <dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd>
    </dl>
    <h2 class="toggler nohide">Auckland</h2>
    <dl class="clear">
        <dt>More information</dt>
            <dd>Hibiscus Coast Community Health Centre</dd><dt>Physical address</dt>
                <dd>136 Whangaparaoa Rd, Red Beach 0932</dd><dt>Postcode</dt>
                <dd>0932</dd><dt>District/town</dt>

                <dd>
                Rodney, Red Beach</dd><dt>Region</dt>
                <dd>Auckland</dd><dt>Phone</dt>
                <dd>(09) 427 0300</dd><dt>Fax</dt>
                <dd>(09) 427 0391</dd><dt>Website</dt>
                <dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd>
    </dl>
    </div>

Search again

And here's my spider;

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from webhealth.items1 import WebhealthItem1

class WebhealthSpider(BaseSpider):

name = "webhealth_content1"

download_delay = 5

allowed_domains = ["webhealth.co.nz"]
start_urls = [
    "http://auckland.webhealth.co.nz/provider/service/view/914136/"
    ]

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    results = hxs.select('//*[@id="content"]/div[1]')
    items1 = []
    for result in results:
        item = WebhealthItem1()
        item['url'] = result.select('//dl/a/@href').extract()
        item['practice'] = result.select('//h1/text()').extract()
        item['hours'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Contact hours")]/following-sibling::dd[1]/text()').extract())
        item['more_hours'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"More information")]/following-sibling::dd[1]/text()').extract())
        item['physical_address'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Physical address")]/following-sibling::dd[1]/text()').extract())
        item['postal_address'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Postal address")]/following-sibling::dd[1]/text()').extract())
        item['postcode'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Postcode")]/following-sibling::dd[1]/text()').extract())
        item['district_town'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"District/town")]/following-sibling::dd[1]/text()').extract())
        item['region'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Region")]/following-sibling::dd[1]/text()').extract())
        item['phone'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Phone")]/following-sibling::dd[1]/text()').extract())
        item['website'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Website")]/following-sibling::dd[1]/a/@href').extract())
        item['email'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Email")]/following-sibling::dd[1]/a/text()').extract())
        items1.append(item)
    return items1

From here, how do I parse list items onto separate lines, with the corresponding //h1/text() value in the name field? Currently I'm getting a list of each Xpath item all in one cell. Is it to do with the way that I am declaring the Xpaths?

Thanks

标签： parsing xpath scrapy

1条回答

迷人小祖宗

2楼-- · 2020-02-11 09:27

First, you are using results = hxs.select('//*[@id="content"]/div[1]') so

    results = hxs.select('//*[@id="content"]/div[1]')
    for result in results:
        ...

will loop on one div only, the first child div of <div id="content" class="clear">

Want you need is to loop on every <dl class="clear">...</dl> within this //*[@id="content"]/div[1] (it would probably be easier to maintain with //*[@id="content"]/div[@class="content"])

        results = hxs.select('//*[@id="content"]/div[@class="content"]/div/dl')

Second, in each loop iteration, you are using absolute XPath expressions (//div...)

result.select('//div/dl/dt[contains(text(), "...")]/following-sibling::dd[1]/text()')

this will select all dd following dt matching the text content starting from the document root node.

Look at this section in Scrapy docs for details.

You need to use relative XPath expressions -- relative within each result scope representing each dl, like dt[contains(text(),"Contact hours")]/following-sibling::dd[1]/text() or ./dt[contains(text(), "Contact hours")]/following-sibling::dd[1]/text(),

The "practice" field however can still use an absolute XPath expression //h1/text(), but you could also have a variable practice set once, and use it in each WebhealthItem1() instance

        ...
        practice = hxs.select('//h1/text()').extract()
        for result in results:
            item = WebhealthItem1()
            ...
            item['practice'] = practice

Here's what your spider would look like with these changes:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from webhealth.items1 import WebhealthItem1

class WebhealthSpider(BaseSpider):

    name = "webhealth_content1"

    download_delay = 5

    allowed_domains = ["webhealth.co.nz"]
    start_urls = [
        "http://auckland.webhealth.co.nz/provider/service/view/914136/"
        ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)

        practice = hxs.select('//h1/text()').extract()
        items1 = []

        results = hxs.select('//*[@id="content"]/div[@class="content"]/div/dl')
        for result in results:
            item = WebhealthItem1()
            #item['url'] = result.select('//dl/a/@href').extract()
            item['practice'] = practice
            item['hours'] = map(unicode.strip,
                result.select('dt[contains(.," Contact hours")]/following-sibling::dd[1]/text()').extract())
            item['more_hours'] = map(unicode.strip,
                result.select('dt[contains(., "More information")]/following-sibling::dd[1]/text()').extract())
            item['physical_address'] = map(unicode.strip,
                result.select('dt[contains(., "Physical address")]/following-sibling::dd[1]/text()').extract())
            item['postal_address'] = map(unicode.strip,
                result.select('dt[contains(., "Postal address")]/following-sibling::dd[1]/text()').extract())
            item['postcode'] = map(unicode.strip,
                result.select('dt[contains(., "Postcode")]/following-sibling::dd[1]/text()').extract())
            item['district_town'] = map(unicode.strip,
                result.select('dt[contains(., "District/town")]/following-sibling::dd[1]/text()').extract())
            item['region'] = map(unicode.strip,
                result.select('dt[contains(., "Region")]/following-sibling::dd[1]/text()').extract())
            item['phone'] = map(unicode.strip,
                result.select('dt[contains(., "Phone")]/following-sibling::dd[1]/text()').extract())
            item['website'] = map(unicode.strip,
                result.select('dt[contains(., "Website")]/following-sibling::dd[1]/a/@href').extract())
            item['email'] = map(unicode.strip,
                result.select('dt[contains(., "Email")]/following-sibling::dd[1]/a/text()').extract())
            items1.append(item)
        return items1

I also created a Cloud9 IDE project with this code. You can play with it at https://c9.io/redapple/so_19309960

0人赞添加讨论(0) 举报

Scrapy: Parsing list items onto separate lines

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间