So, my problem is relatively simple. I have one spider crawling multiple sites, and I need it to return the data in the order I write it in my code. It's posted below.
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from mlbodds.items import MlboddsItem
class MLBoddsSpider(BaseSpider):
name = "sbrforum.com"
allowed_domains = ["sbrforum.com"]
start_urls = [
"http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",
"http://www.sbrforum.com/mlb-baseball/odds-scores/20110329/",
"http://www.sbrforum.com/mlb-baseball/odds-scores/20110330/"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[@id="col_3"]//div[@id="module3_1"]//div[@id="moduleData4952"]')
items = []
for site in sites:
item = MlboddsItem()
item['header'] = site.select('//div[@class="scoreboard-bar"]//h2//span[position()>1]//text()').extract()# | /*//table[position()<2]//tr//th[@colspan="2"]//text()').extract()
item['game1'] = site.select('/*//table[position()=1]//tr//td[@class="tbl-odds-c2"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c4"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c6"]//text()').extract()
items.append(item)
return items
The results are returned in a random order, for example it returns the 29th, then the 28th, then the 30th. I've tried changing the scheduler order from DFO to BFO, just in case that was the problem, but that didn't change anything.
Thanks in advance.
I believe the
you make will scrape the data from the site in the order it appears. Either that or scrapy is going through your
start_urls
in an arbitrary order. To force it to go through them in a predefined order, and mind you, this won't work if you need to crawl more sites, then you can try this:then write a parse2 that does the same thing but appends a Request for url3.html with callback=self.parse3. This is horrible coding style, but I'm just throwing it out in case you need a quick hack.
Scrapy 'Request' has a priority attribute now.http://doc.scrapy.org/en/latest/topics/request-response.html#request-objects If you have many 'Request' in a function and want to process a particular request first, you can set
def parse(self,response): url = http://www.example.com/first yield Request(url=url,callback = self.parse_data,priority=1) url = http://www.example.com/second yield Request(url=url,callback = self.parse_data)
Scrapy will process the one with priority 1 first.
The solution is sequential.
This solution is similar to @wuliang
I started with @Alexis de Tréglodé method but reached a problem:
The fact that your
start_requests()
method returns a list of URLSreturn [ Request(url = start_url) for start_url in start_urls ]
is causing the output to be non-sequential (asynchronous)
If the return is a single response then by creating an alternative
other_urls
can fulfill the requirements. Also,other_urls
can be used to add-into URLs scraped from other webpages.Disclaimer: haven't worked with scrapy specifically
The scraper may be queueing and requeueing requests based on timeouts and HTTP errors, it would be a lot easier if you can get at the date from the response page?
I.e. add another hxs.select statement that grabs the date (just had a look, it is definitely in the response data), and add that to the item dict, sort items based on that.
This is probably a more robust approach, rather than relying on order of scrapes...