I am trying to create a simple web crawler using PHP that is capable of crawling .edu domains, provided the seed urls of the parent.
I have used simple html dom for implementing the crawler while some of the core logic is implemented by me.
I am posting the code below and will try to explain the problems.
private function initiateChildCrawler($parent_Url_Html) {
global $CFG;
static $foundLink;
static $parentID;
static $urlToCrawl_InstanceOfChildren;
$forEachCount = 0;
foreach($parent_Url_Html->getHTML()->find('a') as $foundLink)
{
$forEachCount++;
if($forEachCount<500) {
$foundLink->href = url_to_absolute($parent_Url_Html->getURL(), $foundLink->href);
if($this->validateEduDomain($foundLink->href))
{
//Implement else condition later on
$parentID = $this->loadSaveInstance->parentExists_In_URL_DB_CRAWL($this->returnParentDomain($foundLink->href));
if($parentID != FALSE)
{
if($this->loadSaveInstance->checkUrlDuplication_In_URL_DB_CRAWL($foundLink->href) == FALSE)
{
$urlToCrawl_InstanceOfChildren = new urlToCrawl($foundLink->href);
if($urlToCrawl_InstanceOfChildren->getSimpleDomSource($CFG->finalContext)!= FALSE)
{
$this->loadSaveInstance->url_db_html($urlToCrawl_InstanceOfChildren->getURL(), $urlToCrawl_InstanceOfChildren->getHTML());
$this->loadSaveInstance->saveCrawled_To_URL_DB_CRAWL(NULL, $foundLink->href, "crawled", $parentID);
/*if($recursiveCount<1)
{
$this->initiateChildCrawler($urlToCrawl_InstanceOfChildren);
}*/
}
}
}
}
}
}
}
Now as you can see that initiateChildCrawler is being called by initiateParentCrawler function which passes the parent link to the child crawler. Example of parent link: www.berkeley.edu for which the crawler will find all the links on its main page and return all its html content. This happens until the seed urls are exhausted.
for eg: 1-harvard.edu ->>>>> Will find all the links and return their their html content (by calling childCrawler). Moves to the next parent in parentCrawler. 2-berkeley.edu ->>>>> Will find all the links and return their their html content (by calling childCrawler).
Other functions are self explanatory.
Now the problem: After the childCrawler completes the foreach loop for each link, the function is unable to exit properly. If I am running the script from CLI, the CLI crashes. While running the script in the browser causes the script to terminate.
But if I set the limit of crawling child Links to 10 or something less (by altering the $forEachCount variable), the crawler starts working fine.
Please help me in this regard.
Message from CLI:
Problem signature: Problem Event Name: APPCRASH Application Name: php-cgi.exe Application Version: 5.3.8.0 Application Timestamp: 4e537939 Fault Module Name: php5ts.dll Fault Module Version: 5.3.8.0 Fault Module Timestamp: 4e537a04 Exception Code: c0000005 Exception Offset: 0000c793 OS Version: 6.1.7601.2.1.0.256.48 Locale ID: 1033 Additional Information 1: 0a9e Additional Information 2: 0a9e372d3b4ad19135b953a78882e789 Additional Information 3: 0a9e Additional Information 4: 0a9e372d3b4ad19135b953a78882e789