Scrapy - access data while crawling and randomly c

2019-06-13 13:19发布

Is it possible to access the data while scrapy is crawling ? I have a script that finds a specific keyword and writes the keyword in .csv as well as the link where it was found. However, I have to wait for scrapy to be done crawling, and when that is done it actually outputs the data in the .csv file

I'm also trying to change my user agent randomly, but it's not working. If I'm not allowed for two questions in one, i will post this as a separate question.

#!/usr/bin/env python
# -*- coding: utf-8 -*- 
from scrapy.spiders import Spider
from scrapy import log
from FinalSpider.items import Page
from FinalSpider.settings import USER_AGENT_LIST
from FinalSpider.settings import DOWNLOADER_MIDDLEWARES

import random
import telnetlib
import time
 
 
class FinalSpider(Spider):
    name = "FinalSpider"
    allowed_domains = ['url.com']
    start_urls = ['url.com=%d' %(n)
              for n in xrange(62L, 62L)]


    def parse(self, response):
        item = Page()

        item['URL'] = response.url
        item['Stake'] = ''.join(response.xpath('//div[@class="class"]//span[@class="class" or @class="class"]/text()').extract())
        if item['cur'] in [u'50,00', u'100,00']:
            return item

# 30% useragent change
class RandomUserAgentMiddleware(object):
    def process_request(self, request, spider):
        if random.choice(xrange(1,100)) <= 30:
            log.msg('Changing UserAgent')
            ua  = random.choice(USER_AGENT_LIST)
            if ua:
                request.headers.setdefault('User-Agent', ua)
            log.msg('>>>> UserAgent changed')

1条回答
女痞
2楼-- · 2019-06-13 13:51

You are not obliged to output your collected items (aka "data") into a csv file, you can only run scrapy with:

scrapy crawl myspider

This will be outputting the logs into the terminal, but for storing just the items into a csv file I assume you are doing something like this:

scrapy crawl myspider -o items.csv

Now if you want to store the logs and the items, I suggest you go put this into your settings.py file:

LOG_FILE = "logfile.log"

now you can see something while the spider runs just checking that file.

For your problem with the randomuseragent, please check how to activate scrapy middlewares.

查看更多
登录 后发表回答