Scrapy: Extracting data from source and its links

2019-09-14 12:59发布

Edited question to link to original:

Scrapy getting data from links within table

From the link https://www.tdcj.state.tx.us/death_row/dr_info/trottiewillielast.html

I am trying to get info from the main table as well as the data within the other 2 links within the table. I managed to pull from one, but question is going to the other link and appending the data in one line.

from urlparse import urljoin

import scrapy

from texasdeath.items import DeathItem

class DeathItem(Item):
    firstName = Field()
    lastName = Field()
    Age = Field()
    Date = Field()
    Race = Field()
    County = Field()
    Message = Field()
    Passage = Field()

class DeathSpider(scrapy.Spider):
    name = "death"
    allowed_domains = ["tdcj.state.tx.us"]
    start_urls = [
        "http://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html"
    ]

    def parse(self, response):
        sites = response.xpath('//table/tbody/tr')
        for site in sites:
            item = DeathItem()

            item['firstName'] = site.xpath('td[5]/text()').extract()
            item['lastName'] = site.xpath('td[4]/text()').extract()
            item['Age'] = site.xpath('td[7]/text()').extract()
            item['Date'] = site.xpath('td[8]/text()').extract()
            item['Race'] = site.xpath('td[9]/text()').extract()
            item['County'] = site.xpath('td[10]/text()').extract()

            url = urljoin(response.url, site.xpath("td[3]/a/@href").extract_first())
            url2 = urljoin(response.url, site.xpath("td[2]/a/@href").extract_first())
            if url.endswith("html"):
                request = scrapy.Request(url, meta={"item": item,"url2" : url2}, callback=self.parse_details)
                yield request
            else:
                yield item
def parse_details(self, response):
    item = response.meta["item"]
    url2 = response.meta["url2"]
    item['Message'] = response.xpath("//p[contains(text(), 'Last Statement')]/following-sibling::p/text()").extract()
    request = scrapy.Request(url2, meta={"item": item}, callback=self.parse_details2)
    return request

def parse_details2(self, response):
    item = response.meta["item"]
    item['Passage'] = response.xpath("//p/text()").extract_first()
    return item

I understand how we pass arguments to a request and meta. But still unclear of the flow, at this point I am unsure whether this is possible or not. I have viewed several examples including the ones below:

using scrapy extracting data inside links

How can i use multiple requests and pass items in between them in scrapy python

Technically the data will reflect the main table just with both links containing data from within its link.

Appreciate any help or direction.

1条回答
ら.Afraid
2楼-- · 2019-09-14 13:46

The problem in this case is in this piece of code

if url.endswith("html"):
        yield scrapy.Request(url, meta={"item": item}, callback=self.parse_details)
    else:
        yield item

if url2.endswith("html"):
        yield scrapy.Request(url2, meta={"item": item}, callback=self.parse_details2)
    else:
        yield item

By requesting a link you are creating a new "thread" that will take its own course of life so, the function parse_details wont be able to see what is being done in parse_details2, the way I would do it is call one within each other this way

url = urljoin(response.url, site.xpath("td[2]/a/@href").extract_first())

url2 = urljoin(response.url, site.xpath("td[3]/a/@href").extract_first()

if url.endswith("html"):
    request=scrapy.Request(url, callback=self.parse_details)
    request.meta['item']=item
    request.meta['url2']=url2
    yield request
elif url2.endswith("html"):
    request=scrapy.Request(url2, callback=self.parse_details2)
    request.meta['item']=item
    yield request

else:
    yield item


def parse_details(self, response):
    item = response.meta["item"]
    url2 = response.meta["url2"]
    item['About Me'] = response.xpath("//p[contains(text(), 'About Me')]/following-sibling::p/text()").extract()
    if url2:
        request=scrapy.Request(url2, callback=self.parse_details2)
        request.meta['item']=item
        yield request
    else:
        yield item

This code hasn't been tested thoroughly so comment as you test

查看更多
登录 后发表回答