Crawler script php

2019-02-27 19:35发布

问题:

I've grab a piece of script off here to crawl a website, put it up on my server and it works. The only issue is that if I try and crawl set the depth anything above 4 it doesn't work. I'm wondering if it due to the servers lack of resources or the code itself.

<?php

error_reporting(E_ALL); 

function crawl_page($url, $depth)
{
    static $seen = array();
    if (isset($seen[$url]) || $depth === 0) {
        return;
    }
    $seen[$url] = true;

    $dom = new DOMDocument('1.0');
    @$dom->loadHTMLFile($url);

    $anchors = $dom->getElementsByTagName('a');
    foreach ($anchors as $element) {
        $href = $element->getAttribute('href');
        if (0 !== strpos($href, 'http')) {
            $href = rtrim($url, '/') . '/' . ltrim($href, '/');
        }
        crawl_page($href, $depth - 1);
    }
    echo "URL:",$url,PHP_EOL;
    echo  "<br/>";
}
crawl_page("http://www.mangastream.com/", 2);
?>

EDIT:

I turned on the error reporting for the script and all I get is this

Error 324 (net::ERR_EMPTY_RESPONSE): Unknown error.

回答1:

Try making sure you have all error messages on (display_errors, error_reporting). This should give you more insight as to why it's crashing.

Also, keep in mind that crawling is often illegal depending on what you're going to do with the data.