I'm writing a web scraper using python-requests.
Each page is over 1MB, but the actual data I need to extract is very early on in the document's flow, so I'm wasting time downloading a lot of unnecessary data.
If possible I would like to stop the download as soon as the required data appears in the document's source code, in order to save time.
For example, I only want to extract the text in the "abc" Div, the rest of the document is useless:
<html>
<head>
<title>My site</title>
</head>
<body>
<div id="abc">blah blah...</div>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Mauris fermentum molestie ligula, a pharetra eros mollis ut.</p>
<p>Quisque auctor volutpat lobortis. Vestibulum pellentesque lacus sapien, quis vulputate enim mollis a. Vestibulum ultrices fermentum urna ac sodales.</p>
<p>Nunc sit amet augue at dolor fermentum ultrices. Curabitur faucibus porttitor vehicula. Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>
<p>Etiam sed leo at ipsum blandit dignissim ut a est.</p>
</body>
</html>
Currently I'm simply doing:
r = requests.get(URL)
I landed here from the question: Open first N characters of a url file with Python . However, I don't think that's a strict duplicate since it doesn't explicitly mention in the title whether it is mandatory to use the
requests
module or not. Also, it may be the case that range bytes are not supported by the server where the request is to be made, for whatever reason. In that case, I'd rather simply talk HTTP directly:Sample run:
What you want to use here is called
Range
HTTP Header.See: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html (Specifically the bit on Range).
See also API Docs on Custom Headers
Example: