I have created a new Scrapy spider that is extremely slow. It only scrapes around two pages per second, whereas the other Scrapy crawlers that I have created have been crawling a lot faster.
I was wondering what is it that could cause this issue, and how to possibly fix that. The code is not very different from the other spiders and I am not sure if it is related to the issue, but I'll add it if you think it may be involved.
In fact, I have the impression that the requests are not asynchronous. I have never run into this kind of problem, and I am fairly new to Scrapy.
EDIT
Here's the spider :
class DatamineSpider(scrapy.Spider):
name = "Datamine"
allowed_domains = ["domain.com"]
start_urls = (
'http://www.example.com/en/search/results/smth/smth/r101/m2108m',
)
def parse(self, response):
for href in response.css('.searchListing_details .search_listing_title .searchListing_title a::attr("href")'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_stuff)
next_page = response.css('.pagination .next a::attr("href")')
next_url = response.urljoin(next_page.extract()[0])
yield scrapy.Request(next_url, callback=self.parse)
def parse_stuff(self, response):
item = Item()
item['value'] = float(response.xpath('//*[text()="Price" and not(@class)]/../../div[2]/span/text()').extract()[0].split(' ')[1].replace(',',''))
item['size'] = float(response.xpath('//*[text()="Area" and not(@class)]/../../div[2]/text()').extract()[0].split(' ')[0].replace(',', '.'))
try:
item['yep'] = float(response.xpath('//*[text()="yep" and not(@class)]/../../div[2]/text()').extract()[0])
except IndexError:
print "NO YEP"
else:
yield item
There are only two potential reasons, given that your spiders indicate that you're quite careful/experienced.
- Your target site's response time is very low
- Every page has only 1-2 listing pages (the ones that you parse with
parse_stuff()
).
Highly likely the latter is the reason. It's reasonable to have a response time of half a second. This means that by following the pagination (next) link, you will be effectively be crawling 2 index pages per second. Since you're browsing - I guess - as single domain, your maximum concurrency will be ~ min(CONCURRENT_REQUESTS, CONCURRENT_REQUESTS_PER_DOMAIN)
. This typically is 8 for the default settings. But you won't be able to utilise this concurrency because you don't create listing URLs fast enough. If your .searchListing_details .search_listing_title .searchListing_title a::attr("href")
expression creates just a single URL, the rate with which you create listing URLs is just 2/second whereas to fully utilise your downloader with a concurrency level of 8 you should be creating at least 7 URLs/index page.
The only good solution is to "shard" the index and start crawling e.g. multiple categories in parallel by setting many non-overlaping start_urls
. E.g. you might want to crawl TVs, Washing machines, Stereos or whatever other categories in parallel. If you have 4 such categories and Scrapy "clicks" their 'next' button for each one of them 2 times a second, you will be creating 8 listing pages/second and roughly speaking, you would utilise much better your downloader.
P.S. next_page.extract()[0]
== next_page.extract_first()
Update after discussing this offline: Yes... I don't see anything extra-weird on this website apart from that it's slow (either due to throttling or due to their server capacity). Some specific tricks to go faster. Hit the indices 4x as fast by settings 4 start_urls
instead of 1.
start_urls = (
'http://www.domain.com/en/search/results/smth/sale/r176/m3685m',
'http://www.domain.com/en/search/results/smth/smth/r176/m3685m/offset_200',
'http://www.domain.com/en/search/results/smth/smth/r176/m3685m/offset_400',
'http://www.domain.com/en/search/results/smth/smth/r176/m3685m/offset_600'
)
Then use higher concurrency to allow for more URLs to be loaded in parallel. Essentially "deactivate" CONCURRENT_REQUESTS_PER_DOMAIN
by setting it to a large value e.g. 1000 and then tune your concurrency by setting CONCURRENT_REQUESTS
to 30. By default your concurrency is limited by CONCURRENT_REQUESTS_PER_DOMAIN
to 8 which in, for example, your case where the response time for listing pages is >1.2 sec, means a max of 6 listing pages per second crawling speed. So call your spider like this:
scrapy crawl MySpider -s CONCURRENT_REQUESTS_PER_DOMAIN=1000 -s CONCURRENT_REQUESTS=30
and it should do better.
One more thing. I observe from your target site, that you can get all the information you need including Price
, Area
and yep
from the index pages themselves without having to "hit" any listing pages. This would instantly 10x your crawling speed since you don't need to download all these listing pages in with the for href...
loop. Just parse the listings from the index page.