可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

In my script, requests.get never returns:

import requests

print ("requesting..")

# This call never returns!
r = requests.get(
    "http://www.justdial.com",
    proxies = {'http': '222.255.169.74:8080'},
)

print(r.ok)

What could be the possible reason(s)? Any remedy? What is the default timeout that get uses?

回答1:

What is the default timeout that get uses?

The default timeout is None, which means it'll wait (hang) until the connection is closed.

What happens when you pass in a timeout value?

r = requests.get(
    'http://www.justdial.com',
    proxies={'http': '222.255.169.74:8080'},
    timeout=5
)

回答2:

From requests documentation:

You can tell Requests to stop waiting for a response after a given number of seconds with the timeout parameter:
>>> requests.get('http://github.com', timeout=0.001)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
requests.exceptions.Timeout: HTTPConnectionPool(host='github.com', port=80): Request timed out. (timeout=0.001)
Note:

timeout is not a time limit on the entire response download; rather, an exception is raised if the server has not issued a response for timeout seconds (more precisely, if no bytes have been received on the underlying socket for timeout seconds).

It happens a lot to me that requests.get() takes a very long time to return even if the timeout is 1 second. There are a few way to overcome this problem:

1. Use the TimeoutSauce internal class

From: https://github.com/kennethreitz/requests/issues/1928#issuecomment-35811896

import requests from requests.adapters import TimeoutSauce

class MyTimeout(TimeoutSauce):
    def __init__(self, *args, **kwargs):
        if kwargs['connect'] is None:
            kwargs['connect'] = 5
        if kwargs['read'] is None:
            kwargs['read'] = 5
        super(MyTimeout, self).__init__(*args, **kwargs)

requests.adapters.TimeoutSauce = MyTimeout
This code should cause us to set the read timeout as equal to the connect timeout, which is the timeout value you pass on your Session.get() call. (Note that I haven't actually tested this code, so it may need some quick debugging, I just wrote it straight into the GitHub window.)

2. Use a fork of requests from kevinburke: https://github.com/kevinburke/requests/tree/connect-timeout

From its documentation: https://github.com/kevinburke/requests/blob/connect-timeout/docs/user/advanced.rst

If you specify a single value for the timeout, like this:
r = requests.get('https://github.com', timeout=5)
The timeout value will be applied to both the connect and the read timeouts. Specify a tuple if you would like to set the values separately:
r = requests.get('https://github.com', timeout=(3.05, 27))

NOTE: The change has since been merged to the main Requests project.

3. Using evenlet or signal as already mentioned in the similar question: Timeout for python requests.get entire response

回答3:

Reviewed all the answers and came to conclusion that the problem still exists. On some sites requests may hang infinitely and using multiprocessing seems to be overkill. Here's my approach(Python 3.5+):

import asyncio

import aiohttp


async def get_http(url):
    async with aiohttp.ClientSession(conn_timeout=1, read_timeout=3) as client:
        try:
            async with client.get(url) as response:
                content = await response.text()
                return content, response.status
        except Exception:
            pass


loop = asyncio.get_event_loop()
task = loop.create_task(get_http('http://example.com'))
loop.run_until_complete(task)
result = task.result()
if result is not None:
    content, status = task.result()
    if status == 200:
        print(content)