I am trying to scrape data from the html table, Texas Death Row
I able to pull the existing data from the table using the spider script below:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from texasdeath.items import DeathItem
class DeathSpider(BaseSpider):
name = "death"
allowed_domains = ["tdcj.state.tx.us"]
start_urls = [
"https://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//table/tbody/tr')
for site in sites:
item = DeathItem()
item['firstName'] = site.select('td[5]/text()').extract()
item['lastName'] = site.select('td[4]/text()').extract()
item['Age'] = site.select('td[7]/text()').extract()
item['Date'] = site.select('td[8]/text()').extract()
item['Race'] = site.select('td[9]/text()').extract()
item['County'] = site.select('td[10]/text()').extract()
yield item
Problem is there also links in the table that I am trying to call and get the data from within the links to be appended to my items.
The Scrapy tutorial here, Scrapy Tutorial seems to have a guide on how to pull data from within a directory. But I am having trouble figuring out how to do get the data from the main page as well as to return me data from links within the table.
Instead of yielding an item,
yield
aRequest
and pass theitem
insidemeta
. This is covered in the documentation here.Sample implementation of a spider that would follow the "Offender Information" links if it leads to the offender "details" page (sometimes it leads to an image - in this case the spider would output what it has at the moment):