I'm teaching myself some basic scraping and I've found that sometimes the URL's that I feed into my code return 404, which gums up all the rest of my code.
So I need a test at the top of the code to check if the URL returns 404 or not.
This would seem like a pretty straightfoward task, but Google's not giving me any answers. I worry I'm searching for the wrong stuff.
One blog recommended I use this:
$valid = @fsockopen($url, 80, $errno, $errstr, 30);
and then test to see if $valid if empty or not.
But I think the URL that's giving me problems has a redirect on it, so $valid is coming up empty for all values. Or perhaps I'm doing something else wrong.
I've also looked into a "head request" but I've yet to find any actual code examples I can play with or try out.
Suggestions? And what's this about curl?
This will give you true if url does not return 200 OK
As an additional hint to the great accepted answer:
When using a variation of the proposed solution, I got errors because of php setting 'max_execution_time'. So what I did was the following:
First I set the time limit to a higher number of seconds, in the end I set it back to the value defined in the php settings.
Here is a short solution.
In your case, you can change
application/rdf+xml
to whatever you use.If your running php5 you can use:
Alternatively with php4 a user has contributed the following:
Both would have a result similar to:
Therefore you could just check to see that the header response was OK eg:
W3C Codes and Definitions
With strager's code, you can also check the CURLINFO_HTTP_CODE for other codes. Some websites do not report a 404, rather they simply redirect to a custom 404 page and return 302 (redirect) or something similar. I used this to check if an actual file (eg. robots.txt) existed on the server or not. Clearly this kind of file would not cause a redirect if it existed, but if it didn't it would redirect to a 404 page, which as I said before may not have a 404 code.