Is there any way of limiting the amount of data CURL will fetch? I'm screen scraping data off a page that is 50kb, however the data I require is in the top 1/4 of the page so I really only need to retrieve the first 10kb of the page.
I'm asking because there is a lot of data I need to monitor which results in me transferring close to 60GB of data per month, when only about 5GB of this bandwidth is relevant.
I am using PHP to process the data, however I am flexible in my data retrieval approach, I can use CURL, WGET, fopen etc.
One approach I'm considering is
$fp = fopen("http://www.website.com","r");
fseek($fp,5000);
$data_to_parse = fread($fp,6000);
Does the above mean I will only transfer 6kb from www.website.com, or will fopen load www.website.com into memory meaning I will still transfer the full 50kb?
You may be able to also accomplish what you're looking for using CURL as well.
If you look at the documentation for CURLOPT_WRITEFUNCTION you can register a callback that is called whenever data is available for reading from CURL. You could then count the bytes received, and when you've received over 6,000 bytes you can return 0 to abort the rest of the transfer.
The libcurl documentation describes the callback a bit more:
This function gets called by libcurl as soon as there is data received that needs to be
saved. Return the number of bytes
actually taken care of. If that amount
differs from the amount passed to your
function, it'll signal an error to the
library and it will abort the transfer
and return CURLE_WRITE_ERROR.
The callback function will be passed
as much data as possible in all
invokes, but you cannot possibly make
any assumptions. It may be one byte,
it may be thousands.
This is more an HTTP that a CURL question in fact.
As you guessed, the whole page is going to be downloaded if you use fopen. No matter then if you seek at offset 5000 or not.
The best way to achieve what you want would be to use a partial HTTP GET request, as stated in HTML RFC (http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html):
The semantics of the GET method change
to a "partial GET" if the request
message includes a Range header field.
A partial GET requests that only part
of the entity be transferred, as
described in section 14.35. The
partial GET method is intended to
reduce unnecessary network usage by
allowing partially-retrieved entities
to be completed without transferring
data already held by the client.
The details of partial GET requests using Ranges is described here:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35.2
try a HTTP RANGE request:
GET /largefile.html HTTP/1.1
Range: bytes=0-6000
if the server supports range requests, it will return a 206 Partial Content response code with a Content-Range header and your requested range of bytes (if it doesn't, it will return 200 and the whole file). see http://benramsey.com/archives/206-partial-content-and-range-requests/ for a nice explanation of range requests.
see also Resumable downloads when using PHP to send the file?.
It will download the whole page with the fopen
call, but then it will only read 6kb from that page.
From the PHP manual:
Reading stops as soon as one of the following conditions is met:
- length bytes have been read