I'm teaching myself some basic scraping and I've found that sometimes the URL's that I feed into my code return 404, which gums up all the rest of my code.
So I need a test at the top of the code to check if the URL returns 404 or not.
This would seem like a pretty straightfoward task, but Google's not giving me any answers. I worry I'm searching for the wrong stuff.
One blog recommended I use this:
$valid = @fsockopen($url, 80, $errno, $errstr, 30);
and then test to see if $valid if empty or not.
But I think the URL that's giving me problems has a redirect on it, so $valid is coming up empty for all values. Or perhaps I'm doing something else wrong.
I've also looked into a "head request" but I've yet to find any actual code examples I can play with or try out.
Suggestions? And what's this about curl?
If you are using PHP's
curl
bindings, you can check the error code usingcurl_getinfo
as such:If you are looking for an easiest solution and the one you can try in one go on php5 do
To catch all errors : 4XX and 5XX, i use this little script :
addendum;tested those 3 methods considering performance.
The result, at least in my testing environment:
Curl wins
This test is done under the consideration that only the headers (noBody) is needed. Test yourself:
I found this answer here:
Essentially, you use the "file get contents" method to retrieve the URL, which automatically populates the http response header variable with the status code.
As strager suggests, look into using cURL. You may also be interested in setting CURLOPT_NOBODY with curl_setopt to skip downloading the whole page (you just want the headers).