Have python script using scrapy , which scrapes the data from a website, allocates it to 3 fields and then generates a .csv. Works ok but with one major problem. All fields contain all of the data, rather than it being separated out for each table row. I'm sure this is due to my loop not working and when it finds the xpath it just grabs all the data for every row before moving on to get data for the other 2 fields, instead of creating seperate rows
def parse(self, response):
hxs = HtmlXPathSelector(response)
divs = hxs.select('//tr[@class="someclass"]')
for div in divs:
item = TestBotItem()
item['var1'] = div.select('//table/tbody/tr[*]/td[2]/p/span[2]/text()').extract()
item['var2'] = div.select('//table/tbody/tr[*]/td[3]/p/span[2]/text()').extract()
item['var3'] = div.select('//table/tbody/tr[*]/td[4]/p/text()').extract()
return item
The tr with the * increases in number with each entry on the website I need to crawl, and the other two paths slot in below. How do I edit this so it grabs the first set of data for say //table/tbody/tr[3] only, stores it for all three fields and then moves on to //table/tbody/tr[4] etc??
Update
Works correctly, however I'm trying to add some validation to the pipelines.py file to drop any records where var1 is more than 100%. I'm certain my code below is wrong, and also does "yield" instead of "return" stop the pipeline being used?
from scrapy.exceptions import DropItem
class TestbotPipeline(object):
def process_item(self, item, spider):
if item('var1') > 100%:
return item
else:
raise Dropitem(item)