I wrote an article about this. pasted it here:
Requests' secret: pool_connections and pool_maxsize
Requests is one of the, if not the most well-known Python third-party library for Python programmers. With its simple API and high performance, people tend to use requests instead of urllib2 provided by standard library for HTTP requests. However, people who use requests every day may not know the internals, and today I want to introduce two of them: pool_connections
and pool_maxsize
.
Let's start with Session
:
import requests
s = requests.Session()
s.get('https://www.google.com')
It's pretty simple. You probably know requests' Session
can persists cookie. Cool. But do you know Session
has a mount
method?
mount(prefix, adapter)
Registers a connection adapter to a prefix.
Adapters are sorted in descending order by key length.
No? Well, in fact you've already used this method when you initialize a Session
object:
class Session(SessionRedirectMixin):
def __init__(self):
...
# Default connection adapters.
self.adapters = OrderedDict()
self.mount('https://', HTTPAdapter())
self.mount('http://', HTTPAdapter())
Now comes the interesting part. If you've read Ian Cordasco's article Retries in Requests, you should know that HTTPAdapter
can be used to provide retry functionality. But what is an HTTPAdapter
really? Quote from doc:
class requests.adapters.HTTPAdapter(pool_connections=10, pool_maxsize=10, max_retries=0, pool_block=False)
The built-in HTTP Adapter for urllib3.
Provides a general-case interface for Requests sessions to contact HTTP and HTTPS urls by implementing the Transport Adapter interface. This class will usually be created by the Session class under the covers.
Parameters:
* pool_connections
– The number of urllib3 connection pools to cache.
* pool_maxsize
– The maximum number of connections to save in the pool.
* max_retries(int)
– The maximum number of retries each connection should attempt. Note, this applies only to failed DNS lookups, socket connections and connection timeouts, never to requests where data has made it to the server. By default, Requests does not retry failed connections. If you need granular control over the conditions under which we retry a request, import urllib3’s Retry class and pass that instead.
* pool_block
– Whether the connection pool should block for connections.
Usage:
>>> import requests
>>> s = requests.Session()
>>> a = requests.adapters.HTTPAdapter(max_retries=3)
>>> s.mount('http://', a)
If the above documentation confuses you, here's my explanation: what HTTP Adapter does is simply providing different configurations for different requests according to target url. Remember the code above?
self.mount('https://', HTTPAdapter())
self.mount('http://', HTTPAdapter())
It creates two HTTPAdapter
objects with the default argument pool_connections=10, pool_maxsize=10, max_retries=0, pool_block=False
, and mount to https://
and http://
respectively, which means configuration of the first HTTPAdapter()
will be used if you try to send a request to http://xxx
, and the second HTTPAdapter()
will be used for requests to https://xxx
. Thought in this case the two configurations are the same, requests to http
and https
are still handled separately. We'll see what it means later.
As I said, the main purpose of this article is to explain pool_connections
and pool_maxsize
.
First let's look at pool_connections
. Yesterday I raised a question on stackoverflow cause I'm not sure if my understanding is correct, the answer eliminates my uncertainty. HTTP, as we all know, is based on TCP protocol. An HTTP connection is also a TCP connection, which is identified by a tuple of five values:
(<protocol>, <src addr>, <src port>, <dest addr>, <dest port>)
Say you've established an HTTP/TCP connection with www.example.com
, assume the server supports Keep-Alive
, next time you send request to www.example.com/a
or www.example.com/b
, you could just use the same connection cause none of the five values change. In fact, requests' Session automatically does this for you and will reuse connections as long as it can.
The question is, what determines if you can reuse old connection or not? Yes, pool_connections
!
pool_connections – The number of urllib3 connection pools to cache.
I know, I know, I don't want to brought so many terminologies either, this is the last one, I promise. For easy understanding, one connection pool corresponds to one host, that's what it is.
Here's an example(unrelated lines are ignored):
s = requests.Session()
s.mount('https://', HTTPAdapter(pool_connections=1))
s.get('https://www.baidu.com')
s.get('https://www.zhihu.com')
s.get('https://www.baidu.com')
"""output
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.baidu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 None
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.zhihu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 2621
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.baidu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 None
"""
HTTPAdapter(pool_connections=1)
is mounted to https://
, which means only one connection pool persists at a time. After calling s.get('https://www.baidu.com')
, the cached connection pool is connectionpool('https://www.baidu.com')
. Now s.get('https://www.zhihu.com')
came, and the session found that it cannot use the previously cached connection because it's not the same host(one connection pool corresponds to one host, remember?). Therefore the session had to create a new connection pool, or connection if you would like. Since pool_connections=1
, session cannot hold two connection pools at the same time, thus it abandoned the old one which is connectionpool('https://www.baidu.com')
and kept the new one which is connectionpool('https://www.zhihu.com')
. Next get
is the same. This is why we see three Starting new HTTPS connection
in logging.
What if we set pool_connections
to 2:
s = requests.Session()
s.mount('https://', HTTPAdapter(pool_connections=2))
s.get('https://www.baidu.com')
s.get('https://www.zhihu.com')
s.get('https://www.baidu.com')
"""output
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.baidu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 None
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.zhihu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 2623
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 None
"""
Great, now we only created connections twice and saved one connection establishing time.
Finally, pool_maxsize
.
First and foremost, you should be caring about pool_maxsize
only if you use Session
in a multithreaded environment, like making concurrent requests from multiple threads using the same Session
.
Actually, pool_maxsize
is an argument for initializing urllib3's HTTPConnectionPool
, which is exactly the connection pool we mentioned above.
HTTPConnectionPool
is a container for a collection of connections to a specific host, and pool_maxsize
is the number of connections to save that can be reused. If you're running your code in one thread, it's neither possible or needed to create multiple connections to the same host, cause requests library is blocking, so that HTTP request are always sent one after another.
Things are different if there are multiple threads.
def thread_get(url):
s.get(url)
s = requests.Session()
s.mount('https://', HTTPAdapter(pool_connections=1, pool_maxsize=2))
t1 = Thread(target=thread_get, args=('https://www.zhihu.com',))
t2 = Thread(target=thread_get, args=('https://www.zhihu.com/question/36612174',))
t1.start();t2.start()
t1.join();t2.join()
t3 = Thread(target=thread_get, args=('https://www.zhihu.com/question/39420364',))
t4 = Thread(target=thread_get, args=('https://www.zhihu.com/question/21362402',))
t3.start();t4.start()
"""output
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.zhihu.com
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (2): www.zhihu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/36612174 HTTP/1.1" 200 21906
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 2606
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/21362402 HTTP/1.1" 200 57556
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/39420364 HTTP/1.1" 200 28739
"""
See? It established two connections for the same host www.zhihu.com
, like I said, this can only happen in a multithreaded environment.
In this case, we create a connectionpool with pool_maxsize=2
, and there're no more than two connections at the same time, so it's enough.
We can see that requests from t3
and t4
did not create new connections, they reused the old ones.
What if there's not enough size?
s = requests.Session()
s.mount('https://', HTTPAdapter(pool_connections=1, pool_maxsize=1))
t1 = Thread(target=thread_get, args=('https://www.zhihu.com',))
t2 = Thread(target=thread_get, args=('https://www.zhihu.com/question/36612174',))
t1.start()
t2.start()
t1.join();t2.join()
t3 = Thread(target=thread_get, args=('https://www.zhihu.com/question/39420364',))
t4 = Thread(target=thread_get, args=('https://www.zhihu.com/question/21362402',))
t3.start();t4.start()
t3.join();t4.join()
"""output
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.zhihu.com
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (2): www.zhihu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/36612174 HTTP/1.1" 200 21906
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 2606
WARNING:requests.packages.urllib3.connectionpool:Connection pool is full, discarding connection: www.zhihu.com
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (3): www.zhihu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/39420364 HTTP/1.1" 200 28739
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/21362402 HTTP/1.1" 200 57556
WARNING:requests.packages.urllib3.connectionpool:Connection pool is full, discarding connection: www.zhihu.com
"""
Now, pool_maxsize=1
,warning came as expected:
Connection pool is full, discarding connection: www.zhihu.com
We can also noticed that since only one connection can be saved in this pool, a new connection is created again for t3
or t4
. Obviously this is very inefficient. That's why in urllib3's documentation it says:
If you’re planning on using such a pool in a multithreaded environment, you should set the maxsize of the pool to a higher number, such as the number of threads.
Last but not least, HTTPAdapter
instances mounted to different prefix are independent.
s = requests.Session()
s.mount('https://', HTTPAdapter(pool_connections=1, pool_maxsize=2))
s.mount('https://baidu.com', HTTPAdapter(pool_connections=1, pool_maxsize=1))
t1 = Thread(target=thread_get, args=('https://www.zhihu.com',))
t2 =Thread(target=thread_get, args=('https://www.zhihu.com/question/36612174',))
t1.start();t2.start()
t1.join();t2.join()
t3 = Thread(target=thread_get, args=('https://www.zhihu.com/question/39420364',))
t4 = Thread(target=thread_get, args=('https://www.zhihu.com/question/21362402',))
t3.start();t4.start()
t3.join();t4.join()
"""output
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.zhihu.com
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (2): www.zhihu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/36612174 HTTP/1.1" 200 21906
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 2623
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/39420364 HTTP/1.1" 200 28739
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/21362402 HTTP/1.1" 200 57669
"""
The above code is easy to understand so I don't explain.
I guess that's all. Hope this article help you understand requests better. BTW I created a gist here which contains all of the testing code used in this article. Just download and play with it :)
Appendix
- For https, requests uses urllib3's HTTPSConnectionPool, but it's pretty much the same as HTTPConnectionPool so I don't differeniate them in this article.
Session
's mount
method will ensure the longest prefix gets matched first. Its implementation is pretty interesting so I posted it here.
def mount(self, prefix, adapter):
"""Registers a connection adapter to a prefix.
Adapters are sorted in descending order by key length."""
self.adapters[prefix] = adapter
keys_to_move = [k for k in self.adapters if len(k) < len(prefix)]
for key in keys_to_move:
self.adapters[key] = self.adapters.pop(key)
Note that self.adapters
is an OrderedDict
.
Thanks to @laike9m for the existing Q&A and article, but the existing answers fail to mention the subtleties of pool_maxsize
and its relation to multithreaded code.
Summary
pool_connections
is number of connections that can be kept alive in the pool at a given time from one (host, port, scheme) endpoint. If you want to keep around a max of n
open TCP connections in a pool for reuse with a Session
, you want pool_connections=n
.
pool_maxsize
is effectively irrelevant for users of requests
due to the default value for pool_block
(in requests.adapters.HTTPAdapter
) being False
rather than True
Detail
As correctly pointed out here, pool_connections
is the maximum number of open connections given the adapter's prefix. It's best illustrated through example:
>>> import requests
>>> from requests.adapters import HTTPAdapter
>>>
>>> from urllib3 import add_stderr_logger
>>>
>>> add_stderr_logger() # Turn on requests.packages.urllib3 logging
2018-12-21 20:44:03,979 DEBUG Added a stderr logging handler to logger: urllib3
<StreamHandler <stderr> (NOTSET)>
>>>
>>> s = requests.Session()
>>> s.mount('https://', HTTPAdapter(pool_connections=1))
>>>
>>> # 4 consecutive requests to (github.com, 443, https)
... # A new HTTPS (TCP) connection will be established only on the first conn.
... s.get('https://github.com/requests/requests/blob/master/requests/adapters.py')
2018-12-21 20:44:03,982 DEBUG Starting new HTTPS connection (1): github.com:443
2018-12-21 20:44:04,381 DEBUG https://github.com:443 "GET /requests/requests/blob/master/requests/adapters.py HTTP/1.1" 200 None
<Response [200]>
>>> s.get('https://github.com/requests/requests/blob/master/requests/packages.py')
2018-12-21 20:44:04,548 DEBUG https://github.com:443 "GET /requests/requests/blob/master/requests/packages.py HTTP/1.1" 200 None
<Response [200]>
>>> s.get('https://github.com/urllib3/urllib3/blob/master/src/urllib3/__init__.py')
2018-12-21 20:44:04,881 DEBUG https://github.com:443 "GET /urllib3/urllib3/blob/master/src/urllib3/__init__.py HTTP/1.1" 200 None
<Response [200]>
>>> s.get('https://github.com/python/cpython/blob/master/Lib/logging/__init__.py')
2018-12-21 20:44:06,533 DEBUG https://github.com:443 "GET /python/cpython/blob/master/Lib/logging/__init__.py HTTP/1.1" 200 None
<Response [200]>
Above, the max number of connections is 1; it is (github.com, 443, https)
. If you want to request a resource from a new (host, port, scheme) triple, the Session
internally will dump the existing connection to make room for a new one:
>>> s.get('https://www.rfc-editor.org/info/rfc4045')
2018-12-21 20:46:11,340 DEBUG Starting new HTTPS connection (1): www.rfc-editor.org:443
2018-12-21 20:46:12,185 DEBUG https://www.rfc-editor.org:443 "GET /info/rfc4045 HTTP/1.1" 200 6707
<Response [200]>
>>> s.get('https://www.rfc-editor.org/info/rfc4046')
2018-12-21 20:46:12,667 DEBUG https://www.rfc-editor.org:443 "GET /info/rfc4046 HTTP/1.1" 200 6862
<Response [200]>
>>> s.get('https://www.rfc-editor.org/info/rfc4047')
2018-12-21 20:46:13,837 DEBUG https://www.rfc-editor.org:443 "GET /info/rfc4047 HTTP/1.1" 200 6762
<Response [200]>
You can up the number to pool_connections=2
, then cycle between 3 unique host combinations, and you'll see the same thing in play. (One other thing to note is that the session will retain and send back cookies in this same way.)
Now for pool_maxsize
, which is passed to urllib3.poolmanager.PoolManager
and ultimately to urllib3.connectionpool.HTTPSConnectionPool
. The docstring for maxsize is:
Number of connections to save that can be reused. More than 1 is
useful in multithreaded situations. If block
is set to False,
more connections will be created but they will not be saved once
they've been used.
Incidentally, block=False
is the default for HTTPAdapter
, even though the default is True
for HTTPConnectionPool
. This implies that pool_maxsize
has little to no effect for HTTPAdapter
.
Furthermore, requests.Session()
is not thread safe; you shouldn't use the same session
instance from multiple threads. (See here and here.) If you really want to, the safer way to go would be to lend each thread its own localized session instance, then use that session to make requests over multiple URLs, via threading.local()
:
import threading
import requests
local = threading.local() # values will be different for separate threads.
vars(local) # initially empty; a blank class with no attrs.
def get_or_make_session(**adapter_kwargs):
# `local` will effectively vary based on the thread that is calling it
print('get_or_make_session() called from id:', threading.get_ident())
if not hasattr(local, 'session'):
session = requests.Session()
adapter = requests.adapters.HTTPAdapter(**kwargs)
session.mount('http://', adapter)
session.mount('https://', adapter)
local.session = session
return local.session