Is there any way of limiting the amount of data CURL will fetch? I'm screen scraping data off a page that is 50kb, however the data I require is in the top 1/4 of the page so I really only need to retrieve the first 10kb of the page.
I'm asking because there is a lot of data I need to monitor which results in me transferring close to 60GB of data per month, when only about 5GB of this bandwidth is relevant.
I am using PHP to process the data, however I am flexible in my data retrieval approach, I can use CURL, WGET, fopen etc.
One approach I'm considering is
$fp = fopen("http://www.website.com","r");
fseek($fp,5000);
$data_to_parse = fread($fp,6000);
Does the above mean I will only transfer 6kb from www.website.com, or will fopen load www.website.com into memory meaning I will still transfer the full 50kb?
You may be able to also accomplish what you're looking for using CURL as well.
If you look at the documentation for CURLOPT_WRITEFUNCTION you can register a callback that is called whenever data is available for reading from CURL. You could then count the bytes received, and when you've received over 6,000 bytes you can return 0 to abort the rest of the transfer.
The libcurl documentation describes the callback a bit more:
try a HTTP RANGE request:
if the server supports range requests, it will return a 206 Partial Content response code with a Content-Range header and your requested range of bytes (if it doesn't, it will return 200 and the whole file). see http://benramsey.com/archives/206-partial-content-and-range-requests/ for a nice explanation of range requests.
see also Resumable downloads when using PHP to send the file?.
This is more an HTTP that a CURL question in fact.
As you guessed, the whole page is going to be downloaded if you use fopen. No matter then if you seek at offset 5000 or not.
The best way to achieve what you want would be to use a partial HTTP GET request, as stated in HTML RFC (http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html):
The details of partial GET requests using Ranges is described here: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35.2
It will download the whole page with the
fopen
call, but then it will only read 6kb from that page.From the PHP manual: