Scrapy: Extracting data from source and its links

Edited question to link to original:

Scrapy getting data from links within table

From the link https://www.tdcj.state.tx.us/death_row/dr_info/trottiewillielast.html

I am trying to get info from the main table as well as the data within the other 2 links within the table. I managed to pull from one, but question is going to the other link and appending the data in one line.

from urlparse import urljoin

import scrapy

from texasdeath.items import DeathItem

class DeathItem(Item):
    firstName = Field()
    lastName = Field()
    Age = Field()
    Date = Field()
    Race = Field()
    County = Field()
    Message = Field()
    Passage = Field()

class DeathSpider(scrapy.Spider):
    name = "death"
    allowed_domains = ["tdcj.state.tx.us"]
    start_urls = [
        "http://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html"
    ]

    def parse(self, response):
        sites = response.xpath('//table/tbody/tr')
        for site in sites:
            item = DeathItem()

            item['firstName'] = site.xpath('td[5]/text()').extract()
            item['lastName'] = site.xpath('td[4]/text()').extract()
            item['Age'] = site.xpath('td[7]/text()').extract()
            item['Date'] = site.xpath('td[8]/text()').extract()
            item['Race'] = site.xpath('td[9]/text()').extract()
            item['County'] = site.xpath('td[10]/text()').extract()

            url = urljoin(response.url, site.xpath("td[3]/a/@href").extract_first())
            url2 = urljoin(response.url, site.xpath("td[2]/a/@href").extract_first())
            if url.endswith("html"):
                request = scrapy.Request(url, meta={"item": item,"url2" : url2}, callback=self.parse_details)
                yield request
            else:
                yield item
def parse_details(self, response):
    item = response.meta["item"]
    url2 = response.meta["url2"]
    item['Message'] = response.xpath("//p[contains(text(), 'Last Statement')]/following-sibling::p/text()").extract()
    request = scrapy.Request(url2, meta={"item": item}, callback=self.parse_details2)
    return request

def parse_details2(self, response):
    item = response.meta["item"]
    item['Passage'] = response.xpath("//p/text()").extract_first()
    return item

I understand how we pass arguments to a request and meta. But still unclear of the flow, at this point I am unsure whether this is possible or not. I have viewed several examples including the ones below:

using scrapy extracting data inside links

How can i use multiple requests and pass items in between them in scrapy python

Technically the data will reflect the main table just with both links containing data from within its link.

Appreciate any help or direction.

标签： python xpath scrapy scrapy-spider

1条回答

ら.Afraid

2楼-- · 2019-09-14 13:46

The problem in this case is in this piece of code

if url.endswith("html"):
        yield scrapy.Request(url, meta={"item": item}, callback=self.parse_details)
    else:
        yield item

if url2.endswith("html"):
        yield scrapy.Request(url2, meta={"item": item}, callback=self.parse_details2)
    else:
        yield item

By requesting a link you are creating a new "thread" that will take its own course of life so, the function parse_details wont be able to see what is being done in parse_details2, the way I would do it is call one within each other this way

url = urljoin(response.url, site.xpath("td[2]/a/@href").extract_first())

url2 = urljoin(response.url, site.xpath("td[3]/a/@href").extract_first()

if url.endswith("html"):
    request=scrapy.Request(url, callback=self.parse_details)
    request.meta['item']=item
    request.meta['url2']=url2
    yield request
elif url2.endswith("html"):
    request=scrapy.Request(url2, callback=self.parse_details2)
    request.meta['item']=item
    yield request

else:
    yield item


def parse_details(self, response):
    item = response.meta["item"]
    url2 = response.meta["url2"]
    item['About Me'] = response.xpath("//p[contains(text(), 'About Me')]/following-sibling::p/text()").extract()
    if url2:
        request=scrapy.Request(url2, callback=self.parse_details2)
        request.meta['item']=item
        yield request
    else:
        yield item

This code hasn't been tested thoroughly so comment as you test

0人赞添加讨论(0) 举报

Scrapy: Extracting data from source and its links

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间