Automated URL checking from a MySQL table

2019-05-31 03:26发布

问题:

Okay, I have a list of URLs in a MySQL table. I want the script to automatically check each link in the table for 404, and afterward I want it to store whether the URL was 404'd or not, as well as store a time for last checked.

Is this even possible to do automatically, even if no one runs the script? ie, no one visits the page for a few days, but even with no one visiting the page, it automatically ran the test.

If its possible, how could I go about making a button to do this?

回答1:

No need to use CURL, file_get_contents($url); will return false if the request fails (any other HTTP code other than 2xx), which might be more useful for what you're trying to do, an example:

function urlExists($url)
{
    return (bool) @file_get_contents($url);
}

Will return true if the URL returns useful content, false otherwise.


EDIT: Here is a faster way (it only requests the headers) and the first byte instead of the whole page:

function urlExists($url)
{
    return (bool) @file_get_contents($url, false, null, 0, 1);
}

urlExists('https://stackoverflow.com/iDontExist'); // false

However, in combination with your other question it may be wiser to use something like this:

function url($url)
{
    return @file_get_contents($url);
}

$content = url('https://stackoverflow.com/');

// request has failed (404, 5xx, etc...)
if ($content === false)
{
    // delete or store as "failed" in the DB
}

// request was successful
else
{
    $hash = md5($content); // md5() should be enough but you can also use sha1()

    // store $hash in the DB to keep track of changes
}

Or if you're using PHP 5.1+ you only have to do:

$hash = @md5_file($url);

$hash will be false when the URL fails to load, otherwise it will return the MD5 hash of the contents.

Graciously stolen from @Jamie. =)

This way you only have to make one request instead of two. =)



回答2:

You would use a cron job to do this. Using a cron job you pick when the script is run e.g. every hour, every 6 hours, etc...

To check 404 you can loop through the urls and use get_headers updating a status row each time.



回答3:

Try using curl:

// $url <= The URL from your database
$curl = curl_init($url);
curl_setopt($curl,  CURLOPT_RETURNTRANSFER, TRUE);
$curl_response = curl_exec($curl);
if(curl_getinfo($curl, CURLINFO_HTTP_CODE) == 404) 
{
  // Save in database.
}
curl_close($curl);

If you are running on a shared hosting server, look for the possibility of setting up timed actions (cron jobs). Some hosting services have it, some don't.



回答4:

I would recommend using curl as well, but make HEAD request instead of GET:

<?php
function check_url($url) {
    $c = curl_init();
    curl_setopt($c, CURLOPT_URL, $url);
    curl_setopt($c, CURLOPT_HEADER, 1); // get the header
    curl_setopt($c, CURLOPT_NOBODY, 1); // and *only* get the header
    curl_setopt($c, CURLOPT_RETURNTRANSFER, 1); // get the response as a string from curl_exec(), rather than echoing it
    curl_setopt($c, CURLOPT_FRESH_CONNECT, 1); // don't use a cached version of the url
    if (!curl_exec($c)) { return false; }

    $httpcode = curl_getinfo($c, CURLINFO_HTTP_CODE);
    return $httpcode;
}
?>

Snipplet taken from here.

Recurring execution can be achieved by *nix cron command.