I saw this link [a link] (Pass Scrapy Spider a list of URLs to crawl via .txt file)! This changes the list of start urls. I want to scrape webpages for each domain(from a file) and put results into a separate file(named after the domain). I have scraped data for a website but I have specified the start url and allowed_domains in the spider itself. How to change this using input file.
Update 1:
This is the code that I tried:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
class AppleItem(Item):
reference_link = Field()
rss_link = Field()
class AppleSpider(CrawlSpider):
name = 'apple'
allowed_domains = []
start_urls = []
def __init__(self):
for line in open('./domains.txt', 'r').readlines():
self.allowed_domains.append(line)
self.start_urls.append('http://%s' % line)
rules = [Rule(SgmlLinkExtractor(allow=()), follow=True, callback='parse_item')]
def parse_item(self, response):
sel = HtmlXPathSelector(response)
rsslinks = sel.select('//a[contains(@href, "pdf")]/@href').extract()
items = []
for rss in rsslinks:
item = AppleItem()
item['reference_link'] = response.url
item['rss_link'] = rsslinks
items.append(item)
filename = response.url.split("/")[-2]
open(filename+'.csv', 'wb').write(items)
I get an error when I run this: AttributeError: 'AppleSpider' object has no attribute '_rules'
You can use
__init__
method of spider class to read file and owerritestart_urls
andallowed_domains
.Suppose we have file
domains.txt
with content:Example: