Scrapy and proxies

趁早两清

2楼-- · 2019-01-04 21:40

that would be:

export http_proxy=http://user:password@proxy:port

0人赞添加讨论(0) 举报

够拽才男人

3楼-- · 2019-01-04 21:45

In Windows I put together a couple of previous answers and it worked. I simply did:

C:>  set http_proxy = http://username:password@proxy:port

and then I launched my program:

C:/.../RightFolder> scrapy crawl dmoz

where "dmzo" is the program name (I'm writing it because it's the one you find in a tutorial on internet, and if you're here you have probably started from the tutorial).

0人赞添加讨论(0) 举报

闹够了就滚

4楼-- · 2019-01-04 21:51

As I've had trouble by setting the environment in /etc/environment, here is what I've put in my spider (Python):

os.environ["http_proxy"] = "http://localhost:12345"

0人赞添加讨论(0) 举报

Ridiculous、

5楼-- · 2019-01-04 21:55

Single Proxy

Enable HttpProxyMiddleware in your settings.py, like this:

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 1
}

pass proxy to request via request.meta:

request = Request(url="http://example.com")
request.meta['proxy'] = "host:port"
yield request

You also can choose a proxy address randomly if you have an address pool. Like this:

Multiple Proxies

class MySpider(BaseSpider):
    name = "my_spider"
    def __init__(self, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.proxy_pool = ['proxy_address1', 'proxy_address2', ..., 'proxy_addressN']

    def parse(self, response):
        ...parse code...
        if something:
            yield self.get_request(url)

    def get_request(self, url):
        req = Request(url=url)
        if self.proxy_pool:
            req.meta['proxy'] = random.choice(self.proxy_pool)
        return req

0人赞添加讨论(0) 举报

Juvenile、少年°

6楼-- · 2019-01-04 21:57

There is nice middleware written by someone [1]: https://github.com/aivarsk/scrapy-proxies "Scrapy proxy middleware"

0人赞添加讨论(0) 举报

叼着烟拽天下

7楼-- · 2019-01-04 21:58

From the Scrapy FAQ,

Does Scrapy work with HTTP proxies?

Yes. Support for HTTP proxies is provided (since Scrapy 0.8) through the HTTP Proxy downloader middleware. See HttpProxyMiddleware.

The easiest way to use a proxy is to set the environment variable http_proxy. How this is done depends on your shell.

C:\>set http_proxy=http://proxy:port
csh% setenv http_proxy http://proxy:port
sh$ export http_proxy=http://proxy:port

if you want to use https proxy and visited https web,to set the environment variable http_proxy you should follow below,

C:\>set https_proxy=https://proxy:port
csh% setenv https_proxy https://proxy:port
sh$ export https_proxy=https://proxy:port

0人赞添加讨论(0) 举报

Scrapy and proxies

Does Scrapy work with HTTP proxies?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间