Scrapy spider difference between Crawled pages and

Im writing a Scrapy CrawlSpider that reads a list of ADs on first page, takes some info like thumbs of the listings and AD urls, then yields a request to each of this AD urls to take their details.

It was working and paginating apparently well on test enviroment, but today trying to make a complete run I realized that in log:

Crawled 3852 pages (at 228 pages/min), scraped 256 items (at 15 items/min)

I'm not understanding the reason of this big difference between Crawled pages and Scraped items. Anybody can help me to realize where that items are getting lost?

My spider code:

class MySpider(CrawlSpider):
    name = "myspider"
    allowed_domains = ["myspider.com", "myspider.co"]
    start_urls = [
        "http://www.myspider.com/offers/myCity/typeOfAd/?search=fast",
    ]

    #Pagination
    rules = (
        Rule (
            SgmlLinkExtractor()
           , callback='parse_start_url', follow= True),
    )

    #1st page
    def parse_start_url(self, response):

        hxs = HtmlXPathSelector(response)

        next_page = hxs.select("//a[@class='pagNext']/@href").extract()
        offers = hxs.select("//div[@class='hlist']")

        for offer in offers:
            myItem = myItem()

            myItem['url'] = offer.select('.//span[@class="location"]/a/@href').extract()[0]
            myItem['thumb'] = oferta.select('.//div[@class="itemFoto"]/div/a/img/@src').extract()[0]

            request = Request(myItem['url'], callback = self.second_page)
            request.meta['myItem'] = myItem

            yield request

        if next_page:
            yield Request(next_page[0], callback=self.parse_start_url)


    def second_page(self,response):
        myItem = response.meta['myItem']

        loader = myItemLoader(item=myItem, response=response)

        loader.add_xpath('address', '//span[@itemprop="streetAddress"]/text()') 

        return loader.load_item()

标签： python web-crawler scrapy

1条回答

▲ chillily

2楼-- · 2019-04-09 09:26

Let's say you go to your first start_urls (actually you only have one) and on this page there is only one anchor link (<a>). So your spider crawls the href url in this link and you get control in your callback, parse_start_url. And inside of this page you have 5000 div's with an hlist class. And let's suppose all 5000 of these subsequent URLs were returned 404, not found.

In this case you would have:

Pages crawled: 5001
Items scraped: 0

Let's take another example: on your start url page you have 5000 anchors, but none (as in zero) of those pages have any divs with a class parameter of hlist.

In this case you would have:

Pages crawled: 5001
Items scraped: 0

Your answer lies in the DEBUG log output.

0人赞添加讨论(0) 举报

Scrapy spider difference between Crawled pages and

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间