Scrapy Crawl URLs in Order

2019-01-02 18:28发布

So, my problem is relatively simple. I have one spider crawling multiple sites, and I need it to return the data in the order I write it in my code. It's posted below.

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from mlbodds.items import MlboddsItem

class MLBoddsSpider(BaseSpider):
   name = "sbrforum.com"
   allowed_domains = ["sbrforum.com"]
   start_urls = [
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110329/",
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110330/"
   ]

   def parse(self, response):
       hxs = HtmlXPathSelector(response)
       sites = hxs.select('//div[@id="col_3"]//div[@id="module3_1"]//div[@id="moduleData4952"]')
       items = []
       for site in sites:
           item = MlboddsItem()
           item['header'] = site.select('//div[@class="scoreboard-bar"]//h2//span[position()>1]//text()').extract()# | /*//table[position()<2]//tr//th[@colspan="2"]//text()').extract()
           item['game1'] = site.select('/*//table[position()=1]//tr//td[@class="tbl-odds-c2"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c4"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c6"]//text()').extract()
           items.append(item)
       return items

The results are returned in a random order, for example it returns the 29th, then the 28th, then the 30th. I've tried changing the scheduler order from DFO to BFO, just in case that was the problem, but that didn't change anything.

Thanks in advance.

10条回答
几人难应
2楼-- · 2019-01-02 18:33

There is a much easier way to make scrapy follow the order of starts_url: you can just uncomment and change the concurrent requests in settings.py to 1.

Configure maximum concurrent requests performed by Scrapy (default: 16) 
CONCURRENT_REQUESTS = 1
查看更多
旧时光的记忆
3楼-- · 2019-01-02 18:36

start_urls defines urls which are used in start_requests method. Your parse method is called with a response for each start urls when the page is downloaded. But you cannot control loading times - the first start url might come the last to parse.

A solution -- override start_requests method and add to generated requests a meta with priority key. In parse extract this priority value and add it to the item. In the pipeline do something based in this value. (I don't know why and where you need these urls to be processed in this order).

Or make it kind of synchronous -- store these start urls somewhere. Put in start_urls the first of them. In parse process the first response and yield the item(s), then take next url from your storage and make a request for it with callback for parse.

查看更多
与风俱净
4楼-- · 2019-01-02 18:38

I doubt if it's possible to achieve what you want unless you play with scrapy internals. There are some similar discussions on scrapy google groups e.g.

http://groups.google.com/group/scrapy-users/browse_thread/thread/25da0a888ac19a9/1f72594b6db059f4?lnk=gst

One thing that can also help is setting CONCURRENT_REQUESTS_PER_SPIDER to 1, but it won't completely ensure the order either because the downloader has its own local queue for performance reasons, so the best you can do is prioritize the requests but not ensure its exact order.

查看更多
梦寄多情
5楼-- · 2019-01-02 18:47

The google group discussion suggests using priority attribute in Request object. Scrapy guarantees the urls are crawled in DFO by default. But it does not ensure that the urls are visited in the order they were yielded within your parse callback.

Instead of yielding Request objects you want to return an array of Requests from which objects will be popped till it is empty.

Can you try something like that?

from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
from mlbodds.items import MlboddsItem

class MLBoddsSpider(BaseSpider):
   name = "sbrforum.com"
   allowed_domains = ["sbrforum.com"]

   def start_requests(self):
       start_urls = reversed( [
           "http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",
           "http://www.sbrforum.com/mlb-baseball/odds-scores/20110329/",
           "http://www.sbrforum.com/mlb-baseball/odds-scores/20110330/"
       ] )

       return [ Request(url = start_url) for start_url in start_urls ]

   def parse(self, response):
       hxs = HtmlXPathSelector(response)
       sites = hxs.select('//div[@id="col_3"]//div[@id="module3_1"]//div[@id="moduleData4952"]')
       items = []
       for site in sites:
           item = MlboddsItem()
           item['header'] = site.select('//div[@class="scoreboard-bar"]//h2//span[position()>1]//text()').extract()# | /*//table[position()<2]//tr//th[@colspan="2"]//text()').extract()
           item['game1'] = site.select('/*//table[position()=1]//tr//td[@class="tbl-odds-c2"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c4"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c6"]//text()').extract()
           items.append(item)
       return items
查看更多
萌妹纸的霸气范
6楼-- · 2019-01-02 18:47

Personally I like @user1460015's implementation after I managed to have my own work around solution.

My solution is to use subprocess of Python to call scrapy url by url until all urls have been took care of.

In my code, if user does not specify he/she wants to parse the urls sequentially, we can start the spider in a normal way.

process = CrawlerProcess({'USER_AGENT': 'Mozilla/4.0 (compatible; \
    MSIE 7.0; Windows NT 5.1)'})
process.crawl(Spider, url = args.url)
process.start()

If a user specifies it needs to be done sequentially, we can do this:

for url in urls:
    process = subprocess.Popen('scrapy runspider scrapper.py -a url='\
        + url + ' -o ' + outputfile)
    process.wait()

Note that: this implementation does not handle errors.

查看更多
孤独寂梦人
7楼-- · 2019-01-02 18:48

Off course, you can control it. The top secret is the method how to feed the greedy Engine/Schedulor. You requirement is just a little one. Please see I add a list named "task_urls".

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http.request import Request
from dirbot.items import Website

class DmozSpider(BaseSpider):
   name = "dmoz"
   allowed_domains = ["sbrforum.com"]
   start_urls = [
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",
   ]
   task_urls = [
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110329/",
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110330/"
   ]
   def parse(self, response): 

       hxs = HtmlXPathSelector(response)
       sites = hxs.select('//div[@id="col_3"]//div[@id="module3_1"]//div[@id="moduleData4952"]')
       items = []
       for site in sites:
           item = Website()
           item['header'] = site.select('//div[@class="scoreboard-bar"]//h2//span[position()>1]//text()').extract()# | /*//table[position()<2]//tr//th[@colspan="2"]//text()').extract()
           item['game1'] = site.select('/*//table[position()=1]//tr//td[@class="tbl-odds-c2"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c4"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c6"]//text()').extract()
           items.append(item)
       # Here we feed add new request
       self.task_urls.remove(response.url)
       if self.task_urls:
           r = Request(url=self.task_urls[0], callback=self.parse)
           items.append(r)

       return items

If you want some more complicated case, please see my project: https://github.com/wuliang/TiebaPostGrabber

查看更多
登录 后发表回答