I have already created one spider that collects a list of company names with matching phone numbers. This is then saved to a CSV file.
I am then wanting to scrape data from another site using the phones numbers in the CSV file as POST data. I am wanting it to loop through the same start URL but just scraping the data that each phone number produces until there are no more numbers left in the CSV file.
This is what I have got so far:
from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.http import FormRequest
from scrapy.selector import HtmlXPathSelector
from scrapy import log
import sys
from scrapy.shell import inspect_response
from btw.items import BtwItem
import csv
class BtwSpider(BaseSpider):
name = "btw"
allowed_domains = ["siteToScrape.com"]
start_urls = ["http://www.siteToScrape.com/broadband/broadband_checker"]
def parse(self, response):
phoneNumbers = ['01253873647','01253776535','01142726749']
return [FormRequest.from_response(response,formdata={'broadband_checker[phone]': phoneNumbers[1]},callback=self.after_post)]
def after_post(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[@id="results"]')
items = []
for site in sites:
item = BtwItem()
fttcText = site.select("div[@class='content']/div[@id='btfttc']/ul/li/text()").extract()
# Now we will change the text to be a boolean value
if fttcText[0].count('not') > 0:
fttcEnabled=0
else:
fttcEnabled=1
item['fttcAvailable'] = fttcEnabled
items.append(item)
return items
At the minute I have just been trying to get this looping through a list(phoneNumbers) but I have not even managed to get that to work so far. Once I know how to do that I will be able to get it to pull it from a CSV file by myself. In its current state it is just using the phoneNumber with a index of 1 in the list.
Assuming you have a
phones.csv
file with phones in it:Here's your spider:
Here's what was scraped after running it:
The idea is to scrape the main page using
start_requests
, then read the csv file line-by-line in the callback andyield
newRequests
for each phone number (csv row). Additionally, passphone_number
to the callback through themeta
dictionary in order to write it to theItem
field (I think you need this to distinguish items/results).Hope that helps.