Help! Reading the source code of Scrapy
is not easy for me.
I have a very long start_urls
list. it is about 3,000,000 in a file. So,I make the start_urls
like this:
start_urls = read_urls_from_file(u"XXXX")
def read_urls_from_file(file_path):
with codecs.open(file_path, u"r", encoding=u"GB18030") as f:
for line in f:
try:
url = line.strip()
yield url
except:
print u"read line:%s from file failed!" % line
continue
print u"file read finish!"
MeanWhile, my spider's callback functions are like this:
def parse(self, response):
self.log("Visited %s" % response.url)
return Request(url=("http://www.baidu.com"), callback=self.just_test1)
def just_test1(self, response):
self.log("Visited %s" % response.url)
return Request(url=("http://www.163.com"), callback=self.just_test2)
def just_test2(self, response):
self.log("Visited %s" % response.url)
return []
my questions are:
- the order of the urls used by downloader? Will the requests made by
just_test1
,just_test2
be used by downloader only after the allstart_urls
are used?(I have made some tests, it seems that the answer is No) - What decides the order? Why and How is this order? How can we control it?
- Is this a good way to deal with so many urls which are already in a file? What else?
Thank you very much!!!
Thanks for answers.But I am still a bit confused: By default, Scrapy uses a LIFO queue for storing pending requests.
- The
requests
made by spiders' callback function will be given to thescheduler
.Who does the same thing tostart_url's requests
?The spiderstart_requests()
function only generate an iterator without giving the real requests. - Will the all
requests
(start_url's and callback's) be in the same request's queue?How many queues are there inScrapy
?