Recently I began working on building web scrapers using scrapy. Originally I had deployed my scrapy projects locally using scrapyd.
The scrapy project I built relies on accessing data from a CSV file in order to run
def search(self, response):
with open('data.csv', 'rb') as fin:
reader = csv.reader(fin)
for row in reader:
subscriberID = row[0]
newEffDate = datetime.datetime.now()
counter = 0
yield scrapy.Request(
url = "https://www.healthnet.com/portal/provider/protected/patient/results.action?__checkbox_viewCCDocs=true&subscriberId=" + subscriberID + "&formulary=formulary",
callback = self.find_term,
meta = {
'ID': subscriberID,
'newDate': newEffDate,
'counter' : counter
}
)
It outputs scraped data to another CSV file
for x in data:
with open('missing.csv', 'ab') as fout:
csvwriter = csv.writer(fout, delimiter = ',')
csvwriter.writerow([oldEffDate.strftime("%m/%d/%Y"),subscriberID,ipa])
return
We are in the initial stages of building an application that needs to access and run these scrapy spiders. I decided to host my scrapyd instance on an AWS EC2 linux instance. Deploying to AWS was straightforward (http://bgrva.github.io/blog/2014/04/13/deploy-crawler-to-ec2-with-scrapyd/).
How do I input/output scraped data to/from a scrapyd instance running on an AWS EC2 linux instance?
EDIT: I'm assuming passing a file would look like
curl http://my-ec2.amazonaws.com:6800/schedule.json -d project=projectX -d spider=spider2b -d in=file_path
Is this correct? How would I grab the output from this spider run? Does this approach have security issues?