Python - save requests or BeautifulSoup object loc

I have some code that is quite long, so it takes a long time to run. I want to simply save either the requests object (in this case "name") or the BeautifulSoup object (in this case "soup") locally so that next time I can save time. Here is the code:

from bs4 import BeautifulSoup
import requests

url = 'SOMEURL'
name = requests.get(url)
soup = BeautifulSoup(name.content)

标签： python file beautifulsoup scrape

2条回答

霸刀☆藐视天下

2楼-- · 2020-04-05 09:01

Storing requests locally and restoring them as Beautifoul Soup object latter on

If you are iterating through pages of web site you can store each page with request explained here. Create folder soupCategory in same folder where your script is.

Use any latest user agent for headers

headers = {'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0 Safari/605.1.15'}

def getCategorySoup():
    session = requests.Session()
    retry = Retry(connect=7, backoff_factor=0.5)

    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)

    basic_url = "https://www.somescrappingdomain.com/apartments?adsWithImages=1&page="    
    t0 = time.time() 
    j=0    
    totalPages = 1525 # put your number of pages here        
    for i in range(1,totalPages):         
        url = basic_url+str(i)
        r  = requests.get(url, headers=headers)
        pageName = "./soupCategory/"+str(i)+".html"
        with open(pageName, mode='w', encoding='UTF-8', errors='strict', buffering=1) as f:
            f.write(r.text)        
            print (pageName, end=" ")
    t1 = time.time()
    total = t1-t0
    print ("Total time for getting ",totalPages," category pages is ", round(total), " seconds")
    return

Latter on you can create Beautifoul Soup object as @merlin2011 mentioned with:

with open("/soupCategory/1.html") as f:
  soup = BeautifulSoup(f)

0人赞添加讨论(0) 举报

放我归山

3楼-- · 2020-04-05 09:17

Since name.content is just HTML, you can just dump this to a file and read it back later.

Usually the bottleneck is not the parsing, but instead the network latency of making requests.

from bs4 import BeautifulSoup
import requests

url = 'https://google.com'
name = requests.get(url)

with open("/tmp/A.html", "w") as f:
  f.write(name.content)


# read it back in
with open("/tmp/A.html") as f:
  soup = BeautifulSoup(f)
  # do something with soup

Here is some anecdotal evidence for the fact that bottleneck is in the network.

from bs4 import BeautifulSoup
import requests
import time

url = 'https://google.com'

t1 = time.clock();
name = requests.get(url)
t2 = time.clock();
soup = BeautifulSoup(name.content)
t3 = time.clock();

print t2 - t1, t3 - t2

Output, from running on Thinkpad X1 Carbon, with a fast campus network.

0.11 0.02

0人赞添加讨论(0) 举报

Python - save requests or BeautifulSoup object loc

Storing requests locally and restoring them as Beautifoul Soup object latter on

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间