Python: How to use requests library to access a ur

2019-04-16 13:40发布

问题:

As it says in the title, I am trying to access a url through several different proxies sequentially (using for loop). Right now this is my code:

import requests
import json
with open('proxies.txt') as proxies:
    for line in proxies:
        proxy=json.loads(line)
        with open('urls.txt') as urls:
        for line in urls:
            url=line.rstrip()
            data=requests.get(url, proxies={'http':line})
            data1=data.text
            print data1

and my urls.txt file:

http://api.exip.org/?call=ip

and my proxies.txt file:

{"https": "84.22.41.1:3128"}
{"http":"194.126.181.47:81"}
{"http":"218.108.170.170:82"}

that I got at [www.hidemyass.com][1]

for some reason, the output is

68.6.34.253
68.6.34.253
68.6.34.253

as if it is accessing that website through my own router ip address. In other words, it is not trying to access through the proxies I give it, it is just looping through and using my own over and over again. What am I doing wrong?

回答1:

There are two obvious problems right here:

data=requests.get(url, proxies={'http':line})

First, because you have a for line in urls: inside the for line in proxies:, line is going to be the current URL here, not the current proxy. And besides, even if you weren't reusing line, it would be the JSON string representation, not the dict you decoded from JSON.

Then, if you fix that to use proxy, instead of something like {'https': '83.22.41.1:3128'}, you're passing {'http': {'https': '83.22.41.1:3128'}}. And that obviously isn't a valid value.

To fix both of those problems, just do this:

data=requests.get(url, proxies=proxy)

Meanwhile, what happens when you have an HTTPS URL, but the current proxy is an HTTP proxy? You're not going to use the proxy. So you probably want to add something to skip over them, like this:

if urlparse.urlparse(url).scheme not in proxy:
    continue


回答2:

According to this thread, you need to specify the proxies dictionary as {"protocol" : "ip:port"}, so your proxies file should look like

{"https": "84.22.41.1.3128"}
{"http": "194.126.181.47:81"}
{"http": "218.108.170.170:82"}

EDIT: You're reusing line for both URLs and proxies. It's fine to reuse line in the inner loop, but you should be using proxies=proxy--you've already parsed the JSON and don't need to build another dictionary. Also, as abanert says, you should be doing a check to ensure that the protocol you're requesting matches that of the proxy. The reason the proxies are specified as a dictionary is to allow lookup for the matching protocol.



回答3:

Directly copied from another answer of mine.

Well, actually you can, I've done this with a few lines of code and it works pretty well.

import requests


class Client:

    def __init__(self):
        self._session = requests.Session()
        self.proxies = None

    def set_proxy_pool(self, proxies, auth=None, https=True):
        """Randomly choose a proxy for every GET/POST request        
        :param proxies: list of proxies, like ["ip1:port1", "ip2:port2"]
        :param auth: if proxy needs auth
        :param https: default is True, pass False if you don't need https proxy
        """
        from random import choice

        if https:
            self.proxies = [{'http': p, 'https': p} for p in proxies]
        else:
            self.proxies = [{'http': p} for p in proxies]

        def get_with_random_proxy(url, **kwargs):
            proxy = choice(self.proxies)
            kwargs['proxies'] = proxy
            if auth:
                kwargs['auth'] = auth
            return self._session.original_get(url, **kwargs)

        def post_with_random_proxy(url, *args, **kwargs):
            proxy = choice(self.proxies)
            kwargs['proxies'] = proxy
            if auth:
                kwargs['auth'] = auth
            return self._session.original_post(url, *args, **kwargs)

        self._session.original_get = self._session.get
        self._session.get = get_with_random_proxy
        self._session.original_post = self._session.post
        self._session.post = post_with_random_proxy

    def remove_proxy_pool(self):
        self.proxies = None
        self._session.get = self._session.original_get
        self._session.post = self._session.original_post
        del self._session.original_get
        del self._session.original_post

    # You can define whatever operations using self._session

I use it like this:

client = Client()
client.set_proxy_pool(['112.25.41.136', '180.97.29.57'])

It's simple, but actually works for me.