Scrapy Crawl URLs in Order-第2页回答

So, my problem is relatively simple. I have one spider crawling multiple sites, and I need it to return the data in the order I write it in my code. It's posted below.

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from mlbodds.items import MlboddsItem

class MLBoddsSpider(BaseSpider):
   name = "sbrforum.com"
   allowed_domains = ["sbrforum.com"]
   start_urls = [
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110329/",
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110330/"
   ]

   def parse(self, response):
       hxs = HtmlXPathSelector(response)
       sites = hxs.select('//div[@id="col_3"]//div[@id="module3_1"]//div[@id="moduleData4952"]')
       items = []
       for site in sites:
           item = MlboddsItem()
           item['header'] = site.select('//div[@class="scoreboard-bar"]//h2//span[position()>1]//text()').extract()# | /*//table[position()<2]//tr//th[@colspan="2"]//text()').extract()
           item['game1'] = site.select('/*//table[position()=1]//tr//td[@class="tbl-odds-c2"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c4"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c6"]//text()').extract()
           items.append(item)
       return items

The results are returned in a random order, for example it returns the 29th, then the 28th, then the 30th. I've tried changing the scheduler order from DFO to BFO, just in case that was the problem, but that didn't change anything.

Thanks in advance.

标签： python order scrapy

10条回答

余生请多指教

2楼-- · 2019-01-02 18:49

I believe the

hxs.select('...')

you make will scrape the data from the site in the order it appears. Either that or scrapy is going through your start_urls in an arbitrary order. To force it to go through them in a predefined order, and mind you, this won't work if you need to crawl more sites, then you can try this:

start_urls = ["url1.html"]

def parse1(self, response):
    hxs = HtmlXPathSelector(response)
   sites = hxs.select('blah')
   items = []
   for site in sites:
       item = MlboddsItem()
       item['header'] = site.select('blah')
       item['game1'] = site.select('blah')
       items.append(item)
   return items.append(Request('url2.html', callback=self.parse2))

then write a parse2 that does the same thing but appends a Request for url3.html with callback=self.parse3. This is horrible coding style, but I'm just throwing it out in case you need a quick hack.

0人赞添加讨论(0) 举报

深知你不懂我心

3楼-- · 2019-01-02 18:50

Scrapy 'Request' has a priority attribute now.http://doc.scrapy.org/en/latest/topics/request-response.html#request-objects If you have many 'Request' in a function and want to process a particular request first, you can set

def parse(self,response): url = http://www.example.com/first yield Request(url=url,callback = self.parse_data,priority=1) url = http://www.example.com/second yield Request(url=url,callback = self.parse_data)

Scrapy will process the one with priority 1 first.

0人赞添加讨论(0) 举报

深知你不懂我心

4楼-- · 2019-01-02 18:56

The solution is sequential.
This solution is similar to @wuliang

I started with @Alexis de Tréglodé method but reached a problem:
The fact that your start_requests() method returns a list of URLS
return [ Request(url = start_url) for start_url in start_urls ]
is causing the output to be non-sequential (asynchronous)

If the return is a single response then by creating an alternative other_urls can fulfill the requirements. Also, other_urls can be used to add-into URLs scraped from other webpages.

from scrapy import log
from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
from practice.items import MlboddsItem

log.start()

class PracticeSpider(BaseSpider):
    name = "sbrforum.com"
    allowed_domains = ["sbrforum.com"]

    other_urls = [
            "http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",
            "http://www.sbrforum.com/mlb-baseball/odds-scores/20110329/",
            "http://www.sbrforum.com/mlb-baseball/odds-scores/20110330/",
           ]

    def start_requests(self):
        log.msg('Starting Crawl!', level=log.INFO)
        start_urls = "http://www.sbrforum.com/mlb-baseball/odds-scores/20110327/"
        return [Request(start_urls, meta={'items': []})]

    def parse(self, response):
        log.msg("Begin Parsing", level=log.INFO)
        log.msg("Response from: %s" % response.url, level=log.INFO)
        hxs = HtmlXPathSelector(response)
        sites = hxs.select("//*[@id='moduleData8460']")
        items = response.meta['items']
        for site in sites:
            item = MlboddsItem()
            item['header'] = site.select('//div[@class="scoreboard-bar"]//h2//span[position()>1]//text()').extract()
            item['game1'] = site.select('/*//table[position()=1]//tr//td[@class="tbl-odds-c2"]//text()').extract()
            items.append(item)

        # here we .pop(0) the next URL in line
        if self.other_urls:
            return Request(self.other_urls.pop(0), meta={'items': items})

        return items

0人赞添加讨论(0) 举报

何处买醉

5楼-- · 2019-01-02 18:57

Disclaimer: haven't worked with scrapy specifically

The scraper may be queueing and requeueing requests based on timeouts and HTTP errors, it would be a lot easier if you can get at the date from the response page?

I.e. add another hxs.select statement that grabs the date (just had a look, it is definitely in the response data), and add that to the item dict, sort items based on that.

This is probably a more robust approach, rather than relying on order of scrapes...

0人赞添加讨论(0) 举报

上一页 1 2

Scrapy Crawl URLs in Order

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间