What is the practical difference between these two

2020-06-24 06:18发布

问题:

I have notice there are several ways to iniciate http connections for web scrapping. I am not sure if some are more recent and up-to-date ways of coding, or if they are just different modules with different advantages and disadvantages. More specifically, I am trying to understand what are the differences between the following two approaches, and what would you reccomend?

1) Using urllib3:

http = PoolManager()
r = http.urlopen('GET', url, preload_content=False)
soup = BeautifulSoup(r, "html.parser")

2) Using requests

html = requests.get(url).content
soup = BeautifulSoup(html, "html5lib")

What sets these two options apart, besides the simple fact that they require importing different modules?

回答1:

Under the hood, requests uses urllib3 to do most of the http heavy lifting. When used properly, it should be mostly the same unless you need more advanced configuration.

Except, in your particular example they're not the same:

In the urllib3 example, you're re-using connections whereas in the requests example you're not re-using connections. Here's how you can tell:

>>> import requests
>>> requests.packages.urllib3.add_stderr_logger()
2016-04-29 11:43:42,086 DEBUG Added a stderr logging handler to logger: requests.packages.urllib3
>>> requests.get('https://www.google.com/')
2016-04-29 11:45:59,043 INFO Starting new HTTPS connection (1): www.google.com
2016-04-29 11:45:59,158 DEBUG "GET / HTTP/1.1" 200 None
>>> requests.get('https://www.google.com/')
2016-04-29 11:45:59,815 INFO Starting new HTTPS connection (1): www.google.com
2016-04-29 11:45:59,925 DEBUG "GET / HTTP/1.1" 200 None

To start re-using connections like in a urllib3 PoolManager, you need to make a requests session.

>>> session = requests.session()
>>> session.get('https://www.google.com/')
2016-04-29 11:46:49,649 INFO Starting new HTTPS connection (1): www.google.com
2016-04-29 11:46:49,771 DEBUG "GET / HTTP/1.1" 200 None
>>> session.get('https://www.google.com/')
2016-04-29 11:46:50,548 DEBUG "GET / HTTP/1.1" 200 None

Now it's equivalent to what you were doing with http = PoolManager(). One more note: urllib3 is a lower-level more explicit library, so you explicitly create a pool and you'll explicitly need to specify your SSL certificate location, for example. It's an extra line or two of more work but also a fair bit more control if that's what you're looking for.

All said and done, the comparison becomes:

1) Using urllib3:

import urllib3, certifi
http = urllib3.PoolManager(ca_certs=certifi.where())
html = http.request('GET', url).read()
soup = BeautifulSoup(html, "html5lib")

2) Using requests:

import requests
session = requests.session()
html = session.get(url).content
soup = BeautifulSoup(html, "html5lib")