How to use scrapy with an internet connection thro

My internet connection is through a proxy with authentication and when i try to run scraoy library to make the more simple example, for example :

scrapy shell http://stackoverflow.com

All it's ok until you request something with the XPath selector the response is the next :

>>> hxs.select('//title')
[<HtmlXPathSelector xpath='//title' data=u'<title>ERROR: Cache Access Denied</title'>]

Or if you try to run any spider created inside a project gave me the following error :

C:\Users\Victor\Desktop\test\test>scrapy crawl test
2012-08-11 17:38:02-0400 [scrapy] INFO: Scrapy 0.16.5 started (bot: test)
2012-08-11 17:38:02-0400 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetCon
sole, CloseSpider, WebService, CoreStats, SpiderState
2012-08-11 17:38:02-0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAut
hMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, De
faultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpProxyMiddlewa
re, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-08-11 17:38:02-0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMi
ddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddle
ware
2012-08-11 17:38:02-0400 [scrapy] DEBUG: Enabled item pipelines:
2012-08-11 17:38:02-0400 [test] INFO: Spider opened
2012-08-11 17:38:02-0400 [test] INFO: Crawled 0 pages (at 0 pages/min), scraped
0 items (at 0 items/min)
2012-08-11 17:38:02-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602
4
2012-08-11 17:38:02-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6081
2012-08-11 17:38:47-0400 [test] DEBUG: Retrying <GET http://automation.whatismyi
p.com/n09230945.asp> (failed 1 times): TCP connection timed out: 10060: Se produ
jo un error durante el intento de conexi¾n ya que la parte conectada no respondi
¾ adecuadamente tras un periodo de tiempo, o bien se produjo un error en la cone
xi¾n establecida ya que el host conectado no ha podido responder..
2012-08-11 17:39:02-0400 [test] INFO: Crawled 0 pages (at 0 pages/min), scraped
0 items (at 0 items/min)
...
2012-08-11 17:39:29-0400 [test] INFO: Closing spider (finished)
2012-08-11 17:39:29-0400 [test] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
  'downloader/exception_type_count/twisted.internet.error.TCPTimedOutError': 3,
     'downloader/request_bytes': 732,
     'downloader/request_count': 3,
     'downloader/request_method_count/GET': 3,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2012, 8, 11, 21, 39, 29, 908000),
     'log_count/DEBUG': 9,
     'log_count/ERROR': 1,
     'log_count/INFO': 5,
     'scheduler/dequeued': 3,
     'scheduler/dequeued/memory': 3,
     'scheduler/enqueued': 3,
     'scheduler/enqueued/memory': 3,
     'start_time': datetime.datetime(2012, 8, 11, 21, 38, 2, 876000)}
2012-08-11 17:39:29-0400 [test] INFO: Spider closed (finished)

it appears that my proxy its the problem. Please if somebody know a way to use scrapy with a authentication proxy let me know.

回答1:

Scrapy supports proxies by using HttpProxyMiddleware:

This middleware sets the HTTP proxy to use for requests, by setting the proxy meta value to Request objects. Like the Python standard library modules urllib and urllib2, it obeys the following environment variables:

http_proxy

https_proxy

no_proxy

Also see:

Using Scrapy with proxies
Enabling HttpProxyMiddleware in scrapyd

回答2:

Repeating the answer by Mahmoud M. Abdel-Fattah, because the page is not available now. Credit goes to him, however, I made slight modifications.

If middlewares.py already exist, add the following code into it.

class ProxyMiddleware(object):
    # overwrite process request
    def process_request(self, request, spider):
        # Set the location of the proxy
        request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"

        # Use the following lines if your proxy requires authentication
        proxy_user_pass = "USERNAME:PASSWORD"
        # setup basic authentication for the proxy
        encoded_user_pass = base64.encodestring(proxy_user_pass.encode())
        #encoded_user_pass = base64.encodestring(proxy_user_pass)
        request.headers['Proxy-Authorization'] = 'Basic ' + \
            str(encoded_user_pass)

In settings.py file, add the following code

    DOWNLOADER_MIDDLEWARES = {
    'project_name.middlewares.ProxyMiddleware': 100,
}

This should work by setting http_proxy. However, In my case, I'm trying to access a URL with HTTPS protocol, need to set https_proxy which I'm still investigating. Any lead on that will be of great help.