Easy way to test a URL for 404 in PHP?

2018-12-31 21:46发布

问题:

I\'m teaching myself some basic scraping and I\'ve found that sometimes the URL\'s that I feed into my code return 404, which gums up all the rest of my code.

So I need a test at the top of the code to check if the URL returns 404 or not.

This would seem like a pretty straightfoward task, but Google\'s not giving me any answers. I worry I\'m searching for the wrong stuff.

One blog recommended I use this:

$valid = @fsockopen($url, 80, $errno, $errstr, 30);

and then test to see if $valid if empty or not.

But I think the URL that\'s giving me problems has a redirect on it, so $valid is coming up empty for all values. Or perhaps I\'m doing something else wrong.

I\'ve also looked into a \"head request\" but I\'ve yet to find any actual code examples I can play with or try out.

Suggestions? And what\'s this about curl?

回答1:

If you are using PHP\'s curl bindings, you can check the error code using curl_getinfo as such:

$handle = curl_init($url);
curl_setopt($handle,  CURLOPT_RETURNTRANSFER, TRUE);

/* Get the HTML or whatever is linked in $url. */
$response = curl_exec($handle);

/* Check for 404 (file not found). */
$httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
if($httpCode == 404) {
    /* Handle 404 here. */
}

curl_close($handle);

/* Handle $response here. */


回答2:

If your running php5 you can use:

$url = \'http://www.example.com\';
print_r(get_headers($url, 1));

Alternatively with php4 a user has contributed the following:

/**
This is a modified version of code from \"stuart at sixletterwords dot com\", at 14-Sep-2005 04:52. This version tries to emulate get_headers() function at PHP4. I think it works fairly well, and is simple. It is not the best emulation available, but it works.

Features:
- supports (and requires) full URLs.
- supports changing of default port in URL.
- stops downloading from socket as soon as end-of-headers is detected.

Limitations:
- only gets the root URL (see line with \"GET / HTTP/1.1\").
- don\'t support HTTPS (nor the default HTTPS port).
*/

if(!function_exists(\'get_headers\'))
{
    function get_headers($url,$format=0)
    {
        $url=parse_url($url);
        $end = \"\\r\\n\\r\\n\";
        $fp = fsockopen($url[\'host\'], (empty($url[\'port\'])?80:$url[\'port\']), $errno, $errstr, 30);
        if ($fp)
        {
            $out  = \"GET / HTTP/1.1\\r\\n\";
            $out .= \"Host: \".$url[\'host\'].\"\\r\\n\";
            $out .= \"Connection: Close\\r\\n\\r\\n\";
            $var  = \'\';
            fwrite($fp, $out);
            while (!feof($fp))
            {
                $var.=fgets($fp, 1280);
                if(strpos($var,$end))
                    break;
            }
            fclose($fp);

            $var=preg_replace(\"/\\r\\n\\r\\n.*\\$/\",\'\',$var);
            $var=explode(\"\\r\\n\",$var);
            if($format)
            {
                foreach($var as $i)
                {
                    if(preg_match(\'/^([a-zA-Z -]+): +(.*)$/\',$i,$parts))
                        $v[$parts[1]]=$parts[2];
                }
                return $v;
            }
            else
                return $var;
        }
    }
}

Both would have a result similar to:

Array
(
    [0] => HTTP/1.1 200 OK
    [Date] => Sat, 29 May 2004 12:28:14 GMT
    [Server] => Apache/1.3.27 (Unix)  (Red-Hat/Linux)
    [Last-Modified] => Wed, 08 Jan 2003 23:11:55 GMT
    [ETag] => \"3f80f-1b6-3e1cb03b\"
    [Accept-Ranges] => bytes
    [Content-Length] => 438
    [Connection] => close
    [Content-Type] => text/html
)

Therefore you could just check to see that the header response was OK eg:

$headers = get_headers($url, 1);
if ($headers[0] == \'HTTP/1.1 200 OK\') {
//valid 
}

if ($headers[0] == \'HTTP/1.1 301 Moved Permanently\') {
//moved or redirect page
}

W3C Codes and Definitions



回答3:

With strager\'s code, you can also check the CURLINFO_HTTP_CODE for other codes. Some websites do not report a 404, rather they simply redirect to a custom 404 page and return 302 (redirect) or something similar. I used this to check if an actual file (eg. robots.txt) existed on the server or not. Clearly this kind of file would not cause a redirect if it existed, but if it didn\'t it would redirect to a 404 page, which as I said before may not have a 404 code.

function is_404($url) {
    $handle = curl_init($url);
    curl_setopt($handle,  CURLOPT_RETURNTRANSFER, TRUE);

    /* Get the HTML or whatever is linked in $url. */
    $response = curl_exec($handle);

    /* Check for 404 (file not found). */
    $httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
    curl_close($handle);

    /* If the document has loaded successfully without any redirection or error */
    if ($httpCode >= 200 && $httpCode < 300) {
        return false;
    } else {
        return true;
    }
}


回答4:

As strager suggests, look into using cURL. You may also be interested in setting CURLOPT_NOBODY with curl_setopt to skip downloading the whole page (you just want the headers).



回答5:

If you are looking for an easiest solution and the one you can try in one go on php5 do

file_get_contents(\'www.yoursite.com\');
//and check by echoing
echo $http_response_header[0];


回答6:

I found this answer here:

if(($twitter_XML_raw=file_get_contents($timeline))==false){
    // Retrieve HTTP status code
    list($version,$status_code,$msg) = explode(\' \',$http_response_header[0], 3);

    // Check the HTTP Status code
    switch($status_code) {
        case 200:
                $error_status=\"200: Success\";
                break;
        case 401:
                $error_status=\"401: Login failure.  Try logging out and back in.  Password are ONLY used when posting.\";
                break;
        case 400:
                $error_status=\"400: Invalid request.  You may have exceeded your rate limit.\";
                break;
        case 404:
                $error_status=\"404: Not found.  This shouldn\'t happen.  Please let me know what happened using the feedback link above.\";
                break;
        case 500:
                $error_status=\"500: Twitter servers replied with an error. Hopefully they\'ll be OK soon!\";
                break;
        case 502:
                $error_status=\"502: Twitter servers may be down or being upgraded. Hopefully they\'ll be OK soon!\";
                break;
        case 503:
                $error_status=\"503: Twitter service unavailable. Hopefully they\'ll be OK soon!\";
                break;
        default:
                $error_status=\"Undocumented error: \" . $status_code;
                break;
    }

Essentially, you use the \"file get contents\" method to retrieve the URL, which automatically populates the http response header variable with the status code.



回答7:

addendum;tested those 3 methods considering performance.

The result, at least in my testing environment:

Curl wins

This test is done under the consideration that only the headers (noBody) is needed. Test yourself:

$url = \"http://de.wikipedia.org/wiki/Pinocchio\";

$start_time = microtime(TRUE);
$headers = get_headers($url);
echo $headers[0].\"<br>\";
$end_time = microtime(TRUE);
echo $end_time - $start_time.\"<br>\";


$start_time = microtime(TRUE);
$response = file_get_contents($url);
echo $http_response_header[0].\"<br>\";
$end_time = microtime(TRUE);
echo $end_time - $start_time.\"<br>\";

$start_time = microtime(TRUE);
$handle = curl_init($url);
curl_setopt($handle,  CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($handle, CURLOPT_NOBODY, 1); // and *only* get the header 
/* Get the HTML or whatever is linked in $url. */
$response = curl_exec($handle);
/* Check for 404 (file not found). */
$httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
// if($httpCode == 404) {
    // /* Handle 404 here. */
// }
echo $httpCode.\"<br>\";
curl_close($handle);
$end_time = microtime(TRUE);
echo $end_time - $start_time.\"<br>\";


回答8:

As an additional hint to the great accepted answer:

When using a variation of the proposed solution, I got errors because of php setting \'max_execution_time\'. So what I did was the following:

set_time_limit(120);
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_NOBODY, true);
$result = curl_exec($curl);
set_time_limit(ini_get(\'max_execution_time\'));
curl_close($curl);

First I set the time limit to a higher number of seconds, in the end I set it back to the value defined in the php settings.



回答9:

You can use this code too, to see the status of any link:

<?php

function get_url_status($url, $timeout = 10) 
{
$ch = curl_init();
// set cURL options
$opts = array(CURLOPT_RETURNTRANSFER => true, // do not output to browser
            CURLOPT_URL => $url,            // set URL
            CURLOPT_NOBODY => true,         // do a HEAD request only
            CURLOPT_TIMEOUT => $timeout);   // set timeout
curl_setopt_array($ch, $opts);
curl_exec($ch); // do it!
$status = curl_getinfo($ch, CURLINFO_HTTP_CODE); // find HTTP status
curl_close($ch); // close handle
echo $status; //or return $status;
    //example checking
    if ($status == \'302\') { echo \'HEY, redirection\';}
}

get_url_status(\'http://yourpage.comm\');
?>


回答10:

<?php

$url= \'www.something.com\';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_HEADER, true);   
curl_setopt($ch, CURLOPT_NOBODY, true);    
curl_setopt($ch, CURLOPT_USERAGENT, \"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.4\");
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_TIMEOUT,10);
curl_setopt($ch, CURLOPT_ENCODING, \"gzip\");
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$output = curl_exec($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);


echo $httpcode;
?>


回答11:

Here is a short solution.

$handle = curl_init($uri);
curl_setopt($handle,  CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($handle,CURLOPT_HTTPHEADER,array (\"Accept: application/rdf+xml\"));
curl_setopt($handle, CURLOPT_NOBODY, true);
curl_exec($handle);
$httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
if($httpCode == 200||$httpCode == 303) 
{
    echo \"you might get a reply\";
}
curl_close($handle);

In your case, you can change application/rdf+xml to whatever you use.



回答12:

This will give you true if url does not return 200 OK

function check_404($url) {
   $headers=get_headers($url, 1);
   if ($headers[0]!=\'HTTP/1.1 200 OK\') return true; else return false;
}


回答13:

this is just and slice of code, hope works for you

            $ch = @curl_init();
            @curl_setopt($ch, CURLOPT_URL, \'http://example.com\');
            @curl_setopt($ch, CURLOPT_USERAGENT, \"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1\");
            @curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
            @curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
            @curl_setopt($ch, CURLOPT_TIMEOUT, 10);

            $response       = @curl_exec($ch);
            $errno          = @curl_errno($ch);
            $error          = @curl_error($ch);

                    $response = $response;
                    $info = @curl_getinfo($ch);
return $info[\'http_code\'];


回答14:

To catch all errors : 4XX and 5XX, i use this little script :

function URLIsValid($URL){
    $headers = @get_headers($URL);
    preg_match(\"/ [45][0-9]{2} /\", (string)$headers[0] , $match);
    return count($match) === 0;
}