Only download a part of the document using python

I'm writing a web scraper using python-requests.

Each page is over 1MB, but the actual data I need to extract is very early on in the document's flow, so I'm wasting time downloading a lot of unnecessary data.

If possible I would like to stop the download as soon as the required data appears in the document's source code, in order to save time.

For example, I only want to extract the text in the "abc" Div, the rest of the document is useless:

<html>
<head>
<title>My site</title>
</head>
<body>

<div id="abc">blah blah...</div>

<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Mauris fermentum molestie ligula, a pharetra eros mollis ut.</p>
<p>Quisque auctor volutpat lobortis. Vestibulum pellentesque lacus sapien, quis vulputate enim mollis a. Vestibulum ultrices fermentum urna ac sodales.</p>
<p>Nunc sit amet augue at dolor fermentum ultrices. Curabitur faucibus porttitor vehicula. Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>
<p>Etiam sed leo at ipsum blandit dignissim ut a est.</p>

</body>
</html>

Currently I'm simply doing:

r = requests.get(URL)

标签： python http python-requests

2条回答

smile是对你的礼貌

2楼-- · 2019-09-09 22:27

I landed here from the question: Open first N characters of a url file with Python . However, I don't think that's a strict duplicate since it doesn't explicitly mention in the title whether it is mandatory to use the requests module or not. Also, it may be the case that range bytes are not supported by the server where the request is to be made, for whatever reason. In that case, I'd rather simply talk HTTP directly:

#!/usr/bin/env python

import socket
import time

TCP_HOST = 'stackoverflow.com' # This is the host we are going to query
TCP_PORT = 80 # This is the standard port for HTTP protocol
MAX_LIMIT = 1024 # This is the maximum size of the info we want in bytes

# Create the string to talk HTTP/1.1
MESSAGE = \
"GET /questions/23602412/only-download-a-part-of-the-document-using-python-requests HTTP/1.1\r\n" \
"HOST: stackoverflow.com\r\n" \
"User-Agent: Custom/0.0.1\r\n" \
"Accept: */*\r\n\n"

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) # Create a socket
s.connect((TCP_HOST, TCP_PORT)) # Connect to remote socket at given address
s.send(MESSAGE) # Let's begin the transaction

time.sleep(0.1) # Machines are involved, but... oh, well!

# Keep reading from socket till max limit is reached
curr_size = 0
data = ""
while curr_size < MAX_LIMIT:
    data += s.recv(MAX_LIMIT - curr_size)
    curr_size = len(data)

s.close() # Mark the socket as closed

# Everyone likes a happy ending!
print data + "\n"
print "Length of received data:", len(data)

Sample run:

$ python sample.py
HTTP/1.1 200 OK
Cache-Control: private
Content-Type: text/html; charset=utf-8
X-Frame-Options: SAMEORIGIN
X-Request-Guid: 3098c32c-3423-4e8a-9c7e-6dd530acdf8c
Content-Length: 73444
Accept-Ranges: bytes
Date: Fri, 05 Aug 2016 03:21:55 GMT
Via: 1.1 varnish
Connection: keep-alive
X-Served-By: cache-sin6926-SIN
X-Cache: MISS
X-Cache-Hits: 0
X-Timer: S1470367315.724674,VS0,VE246
X-DNS-Prefetch-Control: off
Set-Cookie: prov=c33383b6-3a4d-730f-02b9-0eab064b3487; domain=.stackoverflow.com; expires=Fri, 01-Jan-2055 00:00:00 GMT; path=/; HttpOnly

<!DOCTYPE html>
<html itemscope itemtype="http://schema.org/QAPage">
<head>

<title>http - Only download a part of the document using python requests - Stack Overflow</title>
    <link rel="shortcut icon" href="//cdn.sstatic.net/Sites/stackoverflow/img/favicon.ico?v=4f32ecc8f43d">
    <link rel="apple-touch-icon image_src" href="//cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon.png?v=c78bd457575a">
    <link rel="search" type="application/open

Length of received data: 1024

0人赞添加讨论(0) 举报

叼着烟拽天下

3楼-- · 2019-09-09 22:45

What you want to use here is called Range HTTP Header.

See: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html (Specifically the bit on Range).

Only download a part of the document using python

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间