Easy way to test a URL for 404 in PHP?

2018-12-31 21:58发布

I'm teaching myself some basic scraping and I've found that sometimes the URL's that I feed into my code return 404, which gums up all the rest of my code.

So I need a test at the top of the code to check if the URL returns 404 or not.

This would seem like a pretty straightfoward task, but Google's not giving me any answers. I worry I'm searching for the wrong stuff.

One blog recommended I use this:

$valid = @fsockopen($url, 80, $errno, $errstr, 30);

and then test to see if $valid if empty or not.

But I think the URL that's giving me problems has a redirect on it, so $valid is coming up empty for all values. Or perhaps I'm doing something else wrong.

I've also looked into a "head request" but I've yet to find any actual code examples I can play with or try out.

Suggestions? And what's this about curl?

14条回答
君临天下
2楼-- · 2018-12-31 22:03

If you are using PHP's curl bindings, you can check the error code using curl_getinfo as such:

$handle = curl_init($url);
curl_setopt($handle,  CURLOPT_RETURNTRANSFER, TRUE);

/* Get the HTML or whatever is linked in $url. */
$response = curl_exec($handle);

/* Check for 404 (file not found). */
$httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
if($httpCode == 404) {
    /* Handle 404 here. */
}

curl_close($handle);

/* Handle $response here. */
查看更多
伤终究还是伤i
3楼-- · 2018-12-31 22:04

If you are looking for an easiest solution and the one you can try in one go on php5 do

file_get_contents('www.yoursite.com');
//and check by echoing
echo $http_response_header[0];
查看更多
若你有天会懂
4楼-- · 2018-12-31 22:05

To catch all errors : 4XX and 5XX, i use this little script :

function URLIsValid($URL){
    $headers = @get_headers($URL);
    preg_match("/ [45][0-9]{2} /", (string)$headers[0] , $match);
    return count($match) === 0;
}
查看更多
还给你的自由
5楼-- · 2018-12-31 22:08

addendum;tested those 3 methods considering performance.

The result, at least in my testing environment:

Curl wins

This test is done under the consideration that only the headers (noBody) is needed. Test yourself:

$url = "http://de.wikipedia.org/wiki/Pinocchio";

$start_time = microtime(TRUE);
$headers = get_headers($url);
echo $headers[0]."<br>";
$end_time = microtime(TRUE);
echo $end_time - $start_time."<br>";


$start_time = microtime(TRUE);
$response = file_get_contents($url);
echo $http_response_header[0]."<br>";
$end_time = microtime(TRUE);
echo $end_time - $start_time."<br>";

$start_time = microtime(TRUE);
$handle = curl_init($url);
curl_setopt($handle,  CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($handle, CURLOPT_NOBODY, 1); // and *only* get the header 
/* Get the HTML or whatever is linked in $url. */
$response = curl_exec($handle);
/* Check for 404 (file not found). */
$httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
// if($httpCode == 404) {
    // /* Handle 404 here. */
// }
echo $httpCode."<br>";
curl_close($handle);
$end_time = microtime(TRUE);
echo $end_time - $start_time."<br>";
查看更多
呛了眼睛熬了心
6楼-- · 2018-12-31 22:09

I found this answer here:

if(($twitter_XML_raw=file_get_contents($timeline))==false){
    // Retrieve HTTP status code
    list($version,$status_code,$msg) = explode(' ',$http_response_header[0], 3);

    // Check the HTTP Status code
    switch($status_code) {
        case 200:
                $error_status="200: Success";
                break;
        case 401:
                $error_status="401: Login failure.  Try logging out and back in.  Password are ONLY used when posting.";
                break;
        case 400:
                $error_status="400: Invalid request.  You may have exceeded your rate limit.";
                break;
        case 404:
                $error_status="404: Not found.  This shouldn't happen.  Please let me know what happened using the feedback link above.";
                break;
        case 500:
                $error_status="500: Twitter servers replied with an error. Hopefully they'll be OK soon!";
                break;
        case 502:
                $error_status="502: Twitter servers may be down or being upgraded. Hopefully they'll be OK soon!";
                break;
        case 503:
                $error_status="503: Twitter service unavailable. Hopefully they'll be OK soon!";
                break;
        default:
                $error_status="Undocumented error: " . $status_code;
                break;
    }

Essentially, you use the "file get contents" method to retrieve the URL, which automatically populates the http response header variable with the status code.

查看更多
伤终究还是伤i
7楼-- · 2018-12-31 22:11

As strager suggests, look into using cURL. You may also be interested in setting CURLOPT_NOBODY with curl_setopt to skip downloading the whole page (you just want the headers).

查看更多
登录 后发表回答