Scrapy xpath returns an empty list although tag an

2019-09-11 06:53发布

In my parse function, here is the code I have written:

hs = Selector(response)
links = hs.xpath(".//*[@id='requisitionListInterface.listRequisition']")
items = []
for x in links:
        item =  CrawlsiteItem()
        item["title"] = x.xpath('.//*[contains(@title, "View this job           description")]/text()').extract()
        items.append(item)
return items    

and title returns an empty list.

I am capturing an xpath with an id tag in the links and then with in the links tag, I want to get list of all the values withthe title that has view this job description.

Please help me fix the error in the code.

标签: xpath scrapy
1条回答
劳资没心,怎么记你
2楼-- · 2019-09-11 07:29

If you cURL the request of the URL you provided with curl "https://cognizant.taleo.net/careersection/indapac_itbpo_ext_career/moresearch.ftl?lang=en" you get back a site way different from the one you see in your browser. Your search results in the following <a> element which does not have any text() attribute to select:

<a id="requisitionListInterface.reqTitleLinkAction" 
    title="View this job description"
    href="#"
    onclick="javascript:setEvent(event);requisition_openRequisitionDescription('requisitionListInterface','actOpenRequisitionDescription',_ftl_api.lstVal('requisitionListInterface', 'requisitionListInterface.listRequisition', 'requisitionListInterface.ID5645', this),_ftl_api.intVal('requisitionListInterface', 'requisitionListInterface.ID5649', this));return ftlUtil_followLink(this);">
</a>

This is because the site loads the site loads the information displayed with an XHR request (you can look up this in Chrome for example) and then the site is updated dynamically with the returned information.

For the information you want to extract you should find this XHR request (it is not hard because this is the only one) and call it from your scraper. Then from the resulting dataset you can extract the required data -- you just have to create a parsing algorithm which goes through this pipe separated format and splits it up into job postings and then extracts the information you need like position, id, date and location.

查看更多
登录 后发表回答