Scrapy - access data while crawling and randomly c

Is it possible to access the data while scrapy is crawling ? I have a script that finds a specific keyword and writes the keyword in .csv as well as the link where it was found. However, I have to wait for scrapy to be done crawling, and when that is done it actually outputs the data in the .csv file

I'm also trying to change my user agent randomly, but it's not working. If I'm not allowed for two questions in one, i will post this as a separate question.

#!/usr/bin/env python
# -*- coding: utf-8 -*- 
from scrapy.spiders import Spider
from scrapy import log
from FinalSpider.items import Page
from FinalSpider.settings import USER_AGENT_LIST
from FinalSpider.settings import DOWNLOADER_MIDDLEWARES

import random
import telnetlib
import time
 
 
class FinalSpider(Spider):
    name = "FinalSpider"
    allowed_domains = ['url.com']
    start_urls = ['url.com=%d' %(n)
              for n in xrange(62L, 62L)]


    def parse(self, response):
        item = Page()

        item['URL'] = response.url
        item['Stake'] = ''.join(response.xpath('//div[@class="class"]//span[@class="class" or @class="class"]/text()').extract())
        if item['cur'] in [u'50,00', u'100,00']:
            return item

# 30% useragent change
class RandomUserAgentMiddleware(object):
    def process_request(self, request, spider):
        if random.choice(xrange(1,100)) <= 30:
            log.msg('Changing UserAgent')
            ua  = random.choice(USER_AGENT_LIST)
            if ua:
                request.headers.setdefault('User-Agent', ua)
            log.msg('>>>> UserAgent changed')

标签： python web-crawler scrapy

1条回答

女痞

2楼-- · 2019-06-13 13:51

You are not obliged to output your collected items (aka "data") into a csv file, you can only run scrapy with:

scrapy crawl myspider

This will be outputting the logs into the terminal, but for storing just the items into a csv file I assume you are doing something like this:

scrapy crawl myspider -o items.csv

Now if you want to store the logs and the items, I suggest you go put this into your settings.py file:

LOG_FILE = "logfile.log"

now you can see something while the spider runs just checking that file.

For your problem with the randomuseragent, please check how to activate scrapy middlewares.

0人赞添加讨论(0) 举报

Scrapy - access data while crawling and randomly c

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间