I am trying to create a simple web crawler using PHP that is capable of crawling .edu domains, provided the seed urls of the parent.
I have used simple html dom for implementing the crawler while some of the core logic is implemented by me.
I am posting the code below and will try to explain the problems.
private function initiateChildCrawler($parent_Url_Html) {
global $CFG;
static $foundLink;
static $parentID;
static $urlToCrawl_InstanceOfChildren;
$forEachCount = 0;
foreach($parent_Url_Html->getHTML()->find('a') as $foundLink)
{
$forEachCount++;
if($forEachCount<500) {
$foundLink->href = url_to_absolute($parent_Url_Html->getURL(), $foundLink->href);
if($this->validateEduDomain($foundLink->href))
{
//Implement else condition later on
$parentID = $this->loadSaveInstance->parentExists_In_URL_DB_CRAWL($this->returnParentDomain($foundLink->href));
if($parentID != FALSE)
{
if($this->loadSaveInstance->checkUrlDuplication_In_URL_DB_CRAWL($foundLink->href) == FALSE)
{
$urlToCrawl_InstanceOfChildren = new urlToCrawl($foundLink->href);
if($urlToCrawl_InstanceOfChildren->getSimpleDomSource($CFG->finalContext)!= FALSE)
{
$this->loadSaveInstance->url_db_html($urlToCrawl_InstanceOfChildren->getURL(), $urlToCrawl_InstanceOfChildren->getHTML());
$this->loadSaveInstance->saveCrawled_To_URL_DB_CRAWL(NULL, $foundLink->href, "crawled", $parentID);
/*if($recursiveCount<1)
{
$this->initiateChildCrawler($urlToCrawl_InstanceOfChildren);
}*/
}
}
}
}
}
}
}
Now as you can see that initiateChildCrawler is being called by initiateParentCrawler function which passes the parent link to the child crawler. Example of parent link: www.berkeley.edu for which the crawler will find all the links on its main page and return all its html content. This happens until the seed urls are exhausted.
for eg: 1-harvard.edu ->>>>> Will find all the links and return their their html content (by calling childCrawler). Moves to the next parent in parentCrawler. 2-berkeley.edu ->>>>> Will find all the links and return their their html content (by calling childCrawler).
Other functions are self explanatory.
Now the problem: After the childCrawler completes the foreach loop for each link, the function is unable to exit properly. If I am running the script from CLI, the CLI crashes. While running the script in the browser causes the script to terminate.
But if I set the limit of crawling child Links to 10 or something less (by altering the $forEachCount variable), the crawler starts working fine.
Please help me in this regard.
Message from CLI:
Problem signature: Problem Event Name: APPCRASH Application Name: php-cgi.exe Application Version: 5.3.8.0 Application Timestamp: 4e537939 Fault Module Name: php5ts.dll Fault Module Version: 5.3.8.0 Fault Module Timestamp: 4e537a04 Exception Code: c0000005 Exception Offset: 0000c793 OS Version: 6.1.7601.2.1.0.256.48 Locale ID: 1033 Additional Information 1: 0a9e Additional Information 2: 0a9e372d3b4ad19135b953a78882e789 Additional Information 3: 0a9e Additional Information 4: 0a9e372d3b4ad19135b953a78882e789
Flat Loop Example:
This will run until all URLs from the stack are processed, so you add (as you have somehow already for the
foreach
) a counter to prevent this from running for too long:You can make it even more intelligent then by not adding URLs to the stack which already exist in it, however then you need to only insert absolute URLs to the stack. However I highly suggest that you do that because there is no need to process a page you've already obtained again (e.g. each page contains a link to the homepage probably). If you want to do this, just increment the
$URLProcessedCount
inside the loop so you keep previous entries as well:Additionally I suggest you use the PHP
DOMDocument
extension instead of simple dom as it's a much more versatile tool.