Is it possible to access the data while scrapy is crawling ? I have a script that finds a specific keyword and writes the keyword in .csv as well as the link where it was found. However, I have to wait for scrapy to be done crawling, and when that is done it actually outputs the data in the .csv file
I'm also trying to change my user agent randomly, but it's not working. If I'm not allowed for two questions in one, i will post this as a separate question.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from scrapy.spiders import Spider
from scrapy import log
from FinalSpider.items import Page
from FinalSpider.settings import USER_AGENT_LIST
from FinalSpider.settings import DOWNLOADER_MIDDLEWARES
import random
import telnetlib
import time
class FinalSpider(Spider):
name = "FinalSpider"
allowed_domains = ['url.com']
start_urls = ['url.com=%d' %(n)
for n in xrange(62L, 62L)]
def parse(self, response):
item = Page()
item['URL'] = response.url
item['Stake'] = ''.join(response.xpath('//div[@class="class"]//span[@class="class" or @class="class"]/text()').extract())
if item['cur'] in [u'50,00', u'100,00']:
return item
# 30% useragent change
class RandomUserAgentMiddleware(object):
def process_request(self, request, spider):
if random.choice(xrange(1,100)) <= 30:
log.msg('Changing UserAgent')
ua = random.choice(USER_AGENT_LIST)
if ua:
request.headers.setdefault('User-Agent', ua)
log.msg('>>>> UserAgent changed')
You are not obliged to output your collected items (aka "data") into a csv file, you can only run scrapy with:
This will be outputting the logs into the terminal, but for storing just the items into a csv file I assume you are doing something like this:
Now if you want to store the logs and the items, I suggest you go put this into your
settings.py
file:now you can see something while the spider runs just checking that file.
For your problem with the randomuseragent, please check how to activate scrapy middlewares.