Scrapy Crawl URLs in Order

2019-01-02 18:28发布

So, my problem is relatively simple. I have one spider crawling multiple sites, and I need it to return the data in the order I write it in my code. It's posted below.

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from mlbodds.items import MlboddsItem

class MLBoddsSpider(BaseSpider):
   name = "sbrforum.com"
   allowed_domains = ["sbrforum.com"]
   start_urls = [
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110329/",
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110330/"
   ]

   def parse(self, response):
       hxs = HtmlXPathSelector(response)
       sites = hxs.select('//div[@id="col_3"]//div[@id="module3_1"]//div[@id="moduleData4952"]')
       items = []
       for site in sites:
           item = MlboddsItem()
           item['header'] = site.select('//div[@class="scoreboard-bar"]//h2//span[position()>1]//text()').extract()# | /*//table[position()<2]//tr//th[@colspan="2"]//text()').extract()
           item['game1'] = site.select('/*//table[position()=1]//tr//td[@class="tbl-odds-c2"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c4"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c6"]//text()').extract()
           items.append(item)
       return items

The results are returned in a random order, for example it returns the 29th, then the 28th, then the 30th. I've tried changing the scheduler order from DFO to BFO, just in case that was the problem, but that didn't change anything.

Thanks in advance.

10条回答
余生请多指教
2楼-- · 2019-01-02 18:49

I believe the

hxs.select('...')

you make will scrape the data from the site in the order it appears. Either that or scrapy is going through your start_urls in an arbitrary order. To force it to go through them in a predefined order, and mind you, this won't work if you need to crawl more sites, then you can try this:

start_urls = ["url1.html"]

def parse1(self, response):
    hxs = HtmlXPathSelector(response)
   sites = hxs.select('blah')
   items = []
   for site in sites:
       item = MlboddsItem()
       item['header'] = site.select('blah')
       item['game1'] = site.select('blah')
       items.append(item)
   return items.append(Request('url2.html', callback=self.parse2))

then write a parse2 that does the same thing but appends a Request for url3.html with callback=self.parse3. This is horrible coding style, but I'm just throwing it out in case you need a quick hack.

查看更多
深知你不懂我心
3楼-- · 2019-01-02 18:50

Scrapy 'Request' has a priority attribute now.http://doc.scrapy.org/en/latest/topics/request-response.html#request-objects If you have many 'Request' in a function and want to process a particular request first, you can set

def parse(self,response): url = http://www.example.com/first yield Request(url=url,callback = self.parse_data,priority=1) url = http://www.example.com/second yield Request(url=url,callback = self.parse_data)

Scrapy will process the one with priority 1 first.

查看更多
深知你不懂我心
4楼-- · 2019-01-02 18:56

The solution is sequential.
This solution is similar to @wuliang

I started with @Alexis de Tréglodé method but reached a problem:
The fact that your start_requests() method returns a list of URLS
return [ Request(url = start_url) for start_url in start_urls ]
is causing the output to be non-sequential (asynchronous)

If the return is a single response then by creating an alternative other_urls can fulfill the requirements. Also, other_urls can be used to add-into URLs scraped from other webpages.

from scrapy import log
from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
from practice.items import MlboddsItem

log.start()

class PracticeSpider(BaseSpider):
    name = "sbrforum.com"
    allowed_domains = ["sbrforum.com"]

    other_urls = [
            "http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",
            "http://www.sbrforum.com/mlb-baseball/odds-scores/20110329/",
            "http://www.sbrforum.com/mlb-baseball/odds-scores/20110330/",
           ]

    def start_requests(self):
        log.msg('Starting Crawl!', level=log.INFO)
        start_urls = "http://www.sbrforum.com/mlb-baseball/odds-scores/20110327/"
        return [Request(start_urls, meta={'items': []})]

    def parse(self, response):
        log.msg("Begin Parsing", level=log.INFO)
        log.msg("Response from: %s" % response.url, level=log.INFO)
        hxs = HtmlXPathSelector(response)
        sites = hxs.select("//*[@id='moduleData8460']")
        items = response.meta['items']
        for site in sites:
            item = MlboddsItem()
            item['header'] = site.select('//div[@class="scoreboard-bar"]//h2//span[position()>1]//text()').extract()
            item['game1'] = site.select('/*//table[position()=1]//tr//td[@class="tbl-odds-c2"]//text()').extract()
            items.append(item)

        # here we .pop(0) the next URL in line
        if self.other_urls:
            return Request(self.other_urls.pop(0), meta={'items': items})

        return items
查看更多
何处买醉
5楼-- · 2019-01-02 18:57

Disclaimer: haven't worked with scrapy specifically

The scraper may be queueing and requeueing requests based on timeouts and HTTP errors, it would be a lot easier if you can get at the date from the response page?

I.e. add another hxs.select statement that grabs the date (just had a look, it is definitely in the response data), and add that to the item dict, sort items based on that.

This is probably a more robust approach, rather than relying on order of scrapes...

查看更多
登录 后发表回答