Twisted Python Failure - Scrapy Issues

2020-03-05 03:21发布

I am trying to use SCRAPY to scrape this website's search reqults for any search query - http://www.bewakoof.com .

The website uses AJAX (in the form of XHR) to display the search results. I managed to track the XHR, and you notice it in my code as below (inside the for loop, wherein i am storing the URL to temp, and incrementing 'i' in the loop)-:

from twisted.internet import reactor
from scrapy.crawler import CrawlerProcess, CrawlerRunner
import scrapy
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
from scrapy.settings import Settings
import datetime
from multiprocessing import Process, Queue
import os
from scrapy.http import Request
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
from scrapy.signalmanager import SignalManager
import re

query='shirt'
query1=query.replace(" ", "+")  


class DmozItem(scrapy.Item):

    productname = scrapy.Field()
    product_link = scrapy.Field()
    current_price = scrapy.Field()
    mrp = scrapy.Field()
    offer = scrapy.Field()
    imageurl = scrapy.Field()
    outofstock_status = scrapy.Field()


class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["http://www.bewakoof.com"]

    def start_requests(self):

        task_urls = [
        ]
        i=1
        for i in range(1,2):
            temp=( "http://www.bewakoof.com/search/searchload/search_text/" + query + "/page_num/" + str(i) )
            task_urls.append(temp)
            i=i+1

        start_urls = (task_urls)
        p=len(task_urls)
        print 'hi'
        return [ Request(url = start_url) for start_url in start_urls ]
        print 'hi'

    def parse(self, response):
        print 'hi'
        print response
        items = []
        for sel in response.xpath('//html/body/div[@class="main-div-of-product-item"]'):
            item = DmozItem()
            item['productname'] = str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@title').extract())[17:-6]
            item['product_link'] = "http://www.bewakoof.com"+str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@href').extract())[3:-2]
            item['current_price']='Rs. ' + str(sel.xpath('div[1]/div[@class="product_info"]/div[@class="product_price_nomrp"]/span[1]/text()').extract())[3:-2]

            item['mrp'] = item['current_price']

            item['offer'] = str('No additional offer available')

            item['imageurl'] = str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@data-original').extract())[3:-2]
            item['outofstock_status'] = str('In Stock')
            items.append(item)


spider1 = DmozSpider()
settings = Settings()
settings.set("PROJECT", "dmoz")
settings.set("DOWNLOAD_DELAY" , 5)
crawler = CrawlerProcess(settings)
crawler.crawl(spider1)
crawler.start()

Now, as I execute this, I get unexpected errors-:

2015-07-09 11:46:01 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-07-09 11:46:01 [scrapy] INFO: Optional features available: ssl, http11
2015-07-09 11:46:01 [scrapy] INFO: Overridden settings: {'DOWNLOAD_DELAY': 5}
2015-07-09 11:46:02 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-07-09 11:46:02 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-07-09 11:46:02 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-07-09 11:46:02 [scrapy] INFO: Enabled item pipelines: 
hi
2015-07-09 11:46:02 [scrapy] INFO: Spider opened
2015-07-09 11:46:02 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-07-09 11:46:02 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-07-09 11:46:03 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-09 11:46:09 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-09 11:46:13 [scrapy] DEBUG: Gave up retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-09 11:46:13 [scrapy] ERROR: Error downloading <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1>: [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-09 11:46:13 [scrapy] INFO: Closing spider (finished)
2015-07-09 11:46:13 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
 'downloader/exception_type_count/twisted.web._newclient.ResponseFailed': 3,
 'downloader/request_bytes': 780,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 7, 9, 6, 16, 13, 793446),
 'log_count/DEBUG': 4,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2015, 7, 9, 6, 16, 2, 890066)}
2015-07-09 11:46:13 [scrapy] INFO: Spider closed (finished)

If you correctly see my code, I have also set the DOWNLOAD_DELAY=5, still it gives the same errors as to when I didn't keep it. I also increased DOWNLOAD_DELAY=10, still it gives the same errors. I have read many questions related to this on Stack Overflow, also on GitHub , but none of them seem to help.

I read in one ofthe answers, that TOR with Polipo, can help. But, I am a bit doubtful for using it, because I don't know whether is it legal to use the combination of TOR with Polipo to scrape websites using Scrapy? (I don't want to run into trouble with any legal issues.) That is the reason why I didn't prefer to use it. So, in case if it is legal, please provide the code for my SPECIFIC CASE, using TOR and POLIPO.

Or rather, if that is illegal, Help me resolve it without using them.

Please help me resolve these errors!

EDIT:

This is my updated code-:

from twisted.internet import reactor
from scrapy.crawler import CrawlerProcess, CrawlerRunner
import scrapy
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
from scrapy.settings import Settings
import datetime
from multiprocessing import Process, Queue
import os
from scrapy.http import Request
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
from scrapy.signalmanager import SignalManager
import re

query='shirt'
query1=query.replace(" ", "+")  


class DmozItem(scrapy.Item):

    productname = scrapy.Field()
    product_link = scrapy.Field()
    current_price = scrapy.Field()
    mrp = scrapy.Field()
    offer = scrapy.Field()
    imageurl = scrapy.Field()
    outofstock_status = scrapy.Field()




class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["http://www.bewakoof.com"]

    def _monkey_patching_HTTPClientParser_statusReceived(self):

        from scrapy.xlib.tx._newclient import HTTPClientParser, ParseError
        old_sr = HTTPClientParser.statusReceived
        def statusReceived(self, status):
            try:
                return old_sr(self, status)
            except ParseError, e:
                if e.args[0] == 'wrong number of parts':
                    return old_sr(self, status + ' OK')
                raise
        statusReceived.__doc__ == old_sr.__doc__
        HTTPClientParser.statusReceived = statusReceived




    def start_requests(self):

        task_urls = [
        ]
        i=1
        for i in range(1,2):
            temp = "http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1"
            task_urls.append(temp)
            i=i+1

        start_urls = (task_urls)
        p=len(task_urls)
        print 'hi'
        self._monkey_patching_HTTPClientParser_statusReceived()
        return [ Request(url = start_url) for start_url in start_urls ]
        print 'hi'

    def parse(self, response):
        print 'hi'
        print response
        items = []
        for sel in response.xpath('//html/body/div[@class="main-div-of-product-item"]'):
            item = DmozItem()
            item['productname'] = str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@title').extract())[17:-6]
            item['product_link'] = "http://www.bewakoof.com"+str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@href').extract())[3:-2]
            item['current_price']='Rs. ' + str(sel.xpath('div[1]/div[@class="product_info"]/div[@class="product_price_nomrp"]/span[1]/text()').extract())[3:-2]

            item['mrp'] = item['current_price']

            item['offer'] = str('No additional offer available')

            item['imageurl'] = str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@data-original').extract())[3:-2]
            item['outofstock_status'] = str('In Stock')
            items.append(item)

        print (items)

spider1 = DmozSpider()
settings = Settings()
settings.set("PROJECT", "dmoz")
settings.set("DOWNLOAD_DELAY" , 5)
crawler = CrawlerProcess(settings)
crawler.crawl(spider1)
crawler.start()

And this is my updated output, as displayed on the terminal-:

2015-07-10 13:06:00 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-07-10 13:06:00 [scrapy] INFO: Optional features available: ssl, http11
2015-07-10 13:06:00 [scrapy] INFO: Overridden settings: {'DOWNLOAD_DELAY': 5}
2015-07-10 13:06:01 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-07-10 13:06:01 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-07-10 13:06:01 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-07-10 13:06:01 [scrapy] INFO: Enabled item pipelines: 
hi
2015-07-10 13:06:01 [scrapy] INFO: Spider opened
2015-07-10 13:06:01 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-07-10 13:06:01 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-07-10 13:06:02 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 13:06:08 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 13:06:12 [scrapy] DEBUG: Gave up retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 13:06:12 [scrapy] ERROR: Error downloading <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1>: [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 13:06:13 [scrapy] INFO: Closing spider (finished)
2015-07-10 13:06:13 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
 'downloader/exception_type_count/twisted.web._newclient.ResponseFailed': 3,
 'downloader/request_bytes': 780,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 7, 10, 7, 36, 13, 11023),
 'log_count/DEBUG': 4,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2015, 7, 10, 7, 36, 1, 114912)}
2015-07-10 13:06:13 [scrapy] INFO: Spider closed (finished)

So, as you see the errors are still the same! :( . So, please help me resolve this!

UPDATED-:

This the output when I try to catch the exception that @JoeLinux suggested to do-:

>>> try:
...     fetch("http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1")
... except Exception as e:
...     e
... 
2015-07-10 17:51:13 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 17:51:14 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 17:51:15 [scrapy] DEBUG: Gave up retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
ResponseFailed([<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>],)
>>> print e.reasons[0].getTraceback()
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/internet/posixbase.py", line 614, in _doReadOrWrite
    why = selectable.doRead()
  File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 214, in doRead
    return self._dataReceived(data)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 220, in _dataReceived
    rval = self.protocol.dataReceived(data)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/endpoints.py", line 114, in dataReceived
    return self._wrappedProtocol.dataReceived(data)
--- <exception caught here> ---
  File "/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py", line 1523, in dataReceived
    self._parser.dataReceived(bytes)
  File "/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py", line 382, in dataReceived
    HTTPParser.dataReceived(self, data)
  File "/usr/lib/python2.7/dist-packages/twisted/protocols/basic.py", line 571, in dataReceived
    why = self.lineReceived(line)
  File "/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py", line 271, in lineReceived
    self.statusReceived(line)
  File "/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py", line 409, in statusReceived
    raise ParseError("wrong number of parts", status)
twisted.web._newclient.ParseError: ('wrong number of parts', 'HTTP/1.1 500')

2条回答
虎瘦雄心在
2楼-- · 2020-03-05 03:22

I got the same error

[<twisted.python.failure.Failure twisted.web._newclient.ParseError: (u'wrong number of parts', 'HTTP/1.1 302')>]

and now it works.

I think you could try this:

  • in method _monkey_patching_HTTPClientParser_statusReceived, change from scrapy.xlib.tx._newclient import HTTPClientParser, ParseError to from twisted.web._newclient import HTTPClientParser, ParseError;

  • in method start_requests, call _monkey_patching_HTTPClientParser_statusReceived for every request in start_urls, for example: def start_requests(self): for url in self.start_urls: self._monkey_patching_HTTPClientParser_statusReceived() yield Request(url, dont_filter=True)

Hope it helps.

查看更多
迷人小祖宗
3楼-- · 2020-03-05 03:29

I was able to replicate your situation in scrapy shell. Here is the error I received in the interactive shell:

$ scrapy shell 
...
>>> try:
>>>    fetch("http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1")
>>> except Exception as e:
>>>    e
2015-07-09 13:53:37-0400 [default] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-09 13:53:38-0400 [default] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-09 13:53:38-0400 [default] DEBUG: Gave up retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
>>> print e.reasons[0].getTraceback()
...
twisted.web._newclient.ParseError: ('wrong number of parts', 'HTTP/1.1 500')

Note that where I put ..., there are lines of text that aren't as important. That last line shows "wrong number of parts". After a little googling, I found this issue:

Error download page: twisted.python.failure.Failure 'scrapy.xlib.tx._newclient.ParseError'

The best that was suggested there was a monkeypatch. Read through the thread and give that a shot.

查看更多
登录 后发表回答