Scrapy and proxies

2019-01-04 21:07发布

How do you utilize proxy support with the python web-scraping framework Scrapy?

标签: python scrapy
7条回答
趁早两清
2楼-- · 2019-01-04 21:40

that would be:

export http_proxy=http://user:password@proxy:port

查看更多
够拽才男人
3楼-- · 2019-01-04 21:45

In Windows I put together a couple of previous answers and it worked. I simply did:

C:>  set http_proxy = http://username:password@proxy:port

and then I launched my program:

C:/.../RightFolder> scrapy crawl dmoz

where "dmzo" is the program name (I'm writing it because it's the one you find in a tutorial on internet, and if you're here you have probably started from the tutorial).

查看更多
闹够了就滚
4楼-- · 2019-01-04 21:51

As I've had trouble by setting the environment in /etc/environment, here is what I've put in my spider (Python):

os.environ["http_proxy"] = "http://localhost:12345"
查看更多
Ridiculous、
5楼-- · 2019-01-04 21:55

Single Proxy

  1. Enable HttpProxyMiddleware in your settings.py, like this:

    DOWNLOADER_MIDDLEWARES = {
        'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 1
    }
    
  2. pass proxy to request via request.meta:

    request = Request(url="http://example.com")
    request.meta['proxy'] = "host:port"
    yield request
    

You also can choose a proxy address randomly if you have an address pool. Like this:

Multiple Proxies

class MySpider(BaseSpider):
    name = "my_spider"
    def __init__(self, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.proxy_pool = ['proxy_address1', 'proxy_address2', ..., 'proxy_addressN']

    def parse(self, response):
        ...parse code...
        if something:
            yield self.get_request(url)

    def get_request(self, url):
        req = Request(url=url)
        if self.proxy_pool:
            req.meta['proxy'] = random.choice(self.proxy_pool)
        return req
查看更多
Juvenile、少年°
6楼-- · 2019-01-04 21:57

There is nice middleware written by someone [1]: https://github.com/aivarsk/scrapy-proxies "Scrapy proxy middleware"

查看更多
叼着烟拽天下
7楼-- · 2019-01-04 21:58

From the Scrapy FAQ,

Does Scrapy work with HTTP proxies?

Yes. Support for HTTP proxies is provided (since Scrapy 0.8) through the HTTP Proxy downloader middleware. See HttpProxyMiddleware.

The easiest way to use a proxy is to set the environment variable http_proxy. How this is done depends on your shell.

C:\>set http_proxy=http://proxy:port
csh% setenv http_proxy http://proxy:port
sh$ export http_proxy=http://proxy:port

if you want to use https proxy and visited https web,to set the environment variable http_proxy you should follow below,

C:\>set https_proxy=https://proxy:port
csh% setenv https_proxy https://proxy:port
sh$ export https_proxy=https://proxy:port
查看更多
登录 后发表回答