Tried to adapt the answer to this question to my issue but not successfully.
Here's some example html code:
<div id="provider-region-addresses">
<h3>Contact details</h3>
<h2 class="toggler nohide">Auckland</h2>
<dl class="clear">
<dt>More information</dt>
<dd>North Shore Hospital</dd><dt>Physical address</dt>
<dd>124 Shakespeare Rd, Takapuna, Auckland 0620</dd><dt>Postal address</dt>
<dd>Private Bag 93503, Takapuna, Auckland 0740</dd><dt>Postcode</dt>
<dd>0740</dd><dt>District/town</dt>
<dd>
North Shore, Takapuna</dd><dt>Region</dt>
<dd>Auckland</dd><dt>Phone</dt>
<dd>(09) 486 8996</dd><dt>Fax</dt>
<dd>(09) 486 8342</dd><dt>Website</dt>
<dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd>
</dl>
<h2 class="toggler nohide">Auckland</h2>
<dl class="clear">
<dt>Physical address</dt>
<dd>Helensville</dd><dt>Postal address</dt>
<dd>PO Box 13, Helensville 0840</dd><dt>Postcode</dt>
<dd>0840</dd><dt>District/town</dt>
<dd>
Rodney, Helensville</dd><dt>Region</dt>
<dd>Auckland</dd><dt>Phone</dt>
<dd>(09) 420 9450</dd><dt>Fax</dt>
<dd>(09) 420 7050</dd><dt>Website</dt>
<dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd>
</dl>
<h2 class="toggler nohide">Auckland</h2>
<dl class="clear">
<dt>Physical address</dt>
<dd>Warkworth</dd><dt>Postal address</dt>
<dd>PO Box 505, Warkworth 0941</dd><dt>Postcode</dt>
<dd>0941</dd><dt>District/town</dt>
<dd>
Rodney, Warkworth</dd><dt>Region</dt>
<dd>Auckland</dd><dt>Phone</dt>
<dd>(09) 422 2700</dd><dt>Fax</dt>
<dd>(09) 422 2709</dd><dt>Website</dt>
<dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd>
</dl>
<h2 class="toggler nohide">Auckland</h2>
<dl class="clear">
<dt>More information</dt>
<dd>Waitakere Hospital</dd><dt>Physical address</dt>
<dd>55-75 Lincoln Rd, Henderson, Auckland 0610</dd><dt>Postal address</dt>
<dd>Private Bag 93115, Henderson, Auckland 0650</dd><dt>Postcode</dt>
<dd>0650</dd><dt>District/town</dt>
<dd>
Waitakere, Henderson</dd><dt>Region</dt>
<dd>Auckland</dd><dt>Phone</dt>
<dd>(09) 839 0000</dd><dt>Fax</dt>
<dd>(09) 837 6634</dd><dt>Website</dt>
<dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd>
</dl>
<h2 class="toggler nohide">Auckland</h2>
<dl class="clear">
<dt>More information</dt>
<dd>Hibiscus Coast Community Health Centre</dd><dt>Physical address</dt>
<dd>136 Whangaparaoa Rd, Red Beach 0932</dd><dt>Postcode</dt>
<dd>0932</dd><dt>District/town</dt>
<dd>
Rodney, Red Beach</dd><dt>Region</dt>
<dd>Auckland</dd><dt>Phone</dt>
<dd>(09) 427 0300</dd><dt>Fax</dt>
<dd>(09) 427 0391</dd><dt>Website</dt>
<dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd>
</dl>
</div>
And here's my spider;
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from webhealth.items1 import WebhealthItem1
class WebhealthSpider(BaseSpider):
name = "webhealth_content1"
download_delay = 5
allowed_domains = ["webhealth.co.nz"]
start_urls = [
"http://auckland.webhealth.co.nz/provider/service/view/914136/"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
results = hxs.select('//*[@id="content"]/div[1]')
items1 = []
for result in results:
item = WebhealthItem1()
item['url'] = result.select('//dl/a/@href').extract()
item['practice'] = result.select('//h1/text()').extract()
item['hours'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Contact hours")]/following-sibling::dd[1]/text()').extract())
item['more_hours'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"More information")]/following-sibling::dd[1]/text()').extract())
item['physical_address'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Physical address")]/following-sibling::dd[1]/text()').extract())
item['postal_address'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Postal address")]/following-sibling::dd[1]/text()').extract())
item['postcode'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Postcode")]/following-sibling::dd[1]/text()').extract())
item['district_town'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"District/town")]/following-sibling::dd[1]/text()').extract())
item['region'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Region")]/following-sibling::dd[1]/text()').extract())
item['phone'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Phone")]/following-sibling::dd[1]/text()').extract())
item['website'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Website")]/following-sibling::dd[1]/a/@href').extract())
item['email'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Email")]/following-sibling::dd[1]/a/text()').extract())
items1.append(item)
return items1
From here, how do I parse list items onto separate lines, with the corresponding //h1/text()
value in the name field? Currently I'm getting a list of each Xpath item all in one cell. Is it to do with the way that I am declaring the Xpaths?
Thanks
First, you are using
results = hxs.select('//*[@id="content"]/div[1]')
sowill loop on one
div
only, the first childdiv
of<div id="content" class="clear">
Want you need is to loop on every
<dl class="clear">...</dl>
within this//*[@id="content"]/div[1]
(it would probably be easier to maintain with//*[@id="content"]/div[@class="content"]
)Second, in each loop iteration, you are using absolute XPath expressions (
//div...
)this will select all
dd
followingdt
matching the text content starting from the document root node.Look at this section in Scrapy docs for details.
You need to use relative XPath expressions -- relative within each
result
scope representing eachdl
, likedt[contains(text(),"Contact hours")]/following-sibling::dd[1]/text()
or./dt[contains(text(), "Contact hours")]/following-sibling::dd[1]/text()
,The "practice" field however can still use an absolute XPath expression
//h1/text()
, but you could also have a variablepractice
set once, and use it in eachWebhealthItem1()
instanceHere's what your spider would look like with these changes:
I also created a Cloud9 IDE project with this code. You can play with it at https://c9.io/redapple/so_19309960