Extracting Site data through Web Crawler outputs e

2019-02-16 13:40发布

站内文章 / PHP

9 0

女痞

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I been trying to extract site table text along with its link from the given table to (which is in site1.com) to my php page using a web crawler.

But unfortunately, due to incorrect input of Array index in the php code, it came error as output.

site1.com

<table border="0" cellpadding="0" cellspacing="0" width="100%" class="Table2">
<tbody><tr>
    <td width="1%" valign="top" class="Title2">&nbsp;</td>
    <td width="65%" valign="top" class="Title2">Subject</td>
    <td width="1%" valign="top" class="Title2">&nbsp;</td>
    <td width="14%" valign="top" align="Center" class="Title2">Last Update</td>
    <td width="1%" valign="top" class="Title2">&nbsp;</td>
    <td width="8%" valign="top" align="Center" class="Title2">Replies</td>
    <td width="1%" valign="top" class="Title2">&nbsp;</td>
    <td width="9%" valign="top" align="Center" class="Title2">Views</td>
</tr>
<tr>
    <td width="1%" height="25">&nbsp;</td>
    <td width="64%" height="25" class="FootNotes2"><a href="/files/forum/2017/1/837110.php" target="_top" class="Links2">Serious dedicated study partner for U World</a> - step12013</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="14%" height="25" class="FootNotes2" align="center">02/11/17 01:50</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="8%" height="25" align="Center" class="FootNotes2">10</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="9%" height="25" align="Center" class="FootNotes2">318</td>
</tr>
</tbody>
</table>

The php. web crawler as ::

<?php
    function get_data($url) {
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_URL,$url);
    $result=curl_exec($ch);
    curl_close($ch);
    return $result;
    }
    $returned_content = get_data('http://www.usmleforum.com/forum/index.php?forum=1');
    $first_step = explode( '<table class="Table2">' , $returned_content );
    $second_step = explode('</table>', $first_step[0]);
    $third_step = explode('<tr>', $second_step[1]);
    // print_r($third_step);
    foreach ($third_step as $key=>$element) {
    $child_first = explode( '<td class="FootNotes2"' , $element );
    $child_second = explode( '</td>' , $child_first[1] );
    $child_third = explode( '<a href=' , $child_second[0] );
    $child_fourth = explode( '</a>' , $child_third[0] );
    $final = "<a href=".$child_fourth[0]."</a></br>";
?>

<li target="_blank" class="itemtitle">
    <?php echo $final?>
</li>

<?php
    if($key==10){
       break;
        }
    }
?>

Now the Array Index on the above php code can be the culprit. (i guess) If so, can some one please explain me how to make this work.

But what my final requirement from this code is:: to get the above text in second with a link associated to it.

Any help is Appreciated..

回答1:

Using the Simple HTML DOM Parser library, you can use the following code:

<?php
    require('simple_html_dom.php'); // you might need to change this, depending on where you saved the library file.

    $html = file_get_html('http://www.usmleforum.com/forum/index.php?forum=1');

    foreach($html->find('td.FootNotes2 a') as $element) { // find all <a>-elements inside a <td class="FootNotes2">-element
        $element->href = "http://www.usmleforum.com" . $element->href;  // you can also access only certain attributes of the elements (e.g. the url).
        echo $element.'</br>';  // do something with the elements.
    }
?>

回答2:

Instead of writing your own parser solution you could use an existing one like Symfony's DomCrawler component: http://symfony.com/doc/current/components/dom_crawler.html

$crawler = new Crawler($returned_content);
$linkTexts = $crawler->filterXPath('//a')->each(function (Crawler $node, $i) {
    return $node->text();
});

Or if you want to traverse the DOM tree yourself you can use DOMDocument's loadHTML http://php.net/manual/en/domdocument.loadhtml.php

$document = new DOMDocument();
$document->loadHTML($returned_content);
foreach ($document->getElementsByTagName('a') as $link) {
    $text = $link->nodeValue;
}

EDIT:

To get the links you want, the code assumes you have a $returned_content variable with the HTML you want to parse.

// creating a new instance of DOMDocument (DOM = Document Object Model)
$domDocument = new DOMDocument();
// save previous libxml error reporting and set error reporting to internal
// to be able to parse not well formed HTML doc
$previousErrorReporting = libxml_use_internal_errors(true);
$domDocument->loadHTML($returned_content);
libxml_use_internal_errors($previousErrorReporting);
$links = [];
/** @var DOMElement $node */
// getting all <a> element from the HTML
foreach ($domDocument->getElementsByTagName('a') as $node) {
    $parentNode = $node->parentNode;
    // checking if the <a> is under a <td> that has class="FootNotes2"
    $isChildOfAFootNotesTd = $parentNode->nodeName === 'td' && $parentNode->getAttribute('class') === 'FootNotes2';
    // checking if the <a> has class="Links2"
    $isLinkOfLink2Class = $node->getAttribute('class') == 'Links2';
    // as I assumed you wanted links from the <td> this check makes sure that both of the above conditions are fulfilled
    if ($isChildOfAFootNotesTd && $isLinkOfLink2Class) {
        $links[] = [
            'href' => $node->getAttribute('href'),
            'text' => $parentNode->textContent,
        ];
    }
}

print_r($links);

This will create you an array similar to:

Array
(
    [0] => Array
    (
        [href] => /files/forum/2017/1/837242.php
        [text] => Q@Q Drill Time ① - cardio69
    ) 
    [1] => Array
    (
        [href] => /files/forum/2017/1/837356.php
        [text] => study partner in Houston - lacy
    )
    [2] => Array
    (
        [href] => /files/forum/2017/1/837110.php
        [text] => Serious dedicated study partner for U World - step12013
    )
    ...

回答3:

I tried the same code for another site. and it works. Please take a look at it:

<?php
    function get_data($url) {
      $ch = curl_init();
      curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
      curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
      curl_setopt($ch, CURLOPT_URL,$url);
      $result=curl_exec($ch);
      curl_close($ch);
      return $result;
    }
    $returned_content = get_data('http://www.usmle-forums.com/usmle-step-1-forum/');
    $first_step = explode( '<tbody id="threadbits_forum_26">' , $returned_content );
    $second_step = explode('</tbody>', $first_step[1]);
    $third_step = explode('<tr>', $second_step[0]);
    // print_r($third_step);
    foreach ($third_step as $element) {
      $child_first = explode( '<td class="alt1"' , $element );
      $child_second = explode( '</td>' , $child_first[1] );
      $child_third = explode( '<a href=' , $child_second[0] );
      $child_fourth = explode( '</a>' , $child_third[1] );
      echo $final = "<a href=".$child_fourth[0]."</a></br>";
    }
    ?>

I know its too much to ask, but can you please make a code out of these two which make the crawler work.

@jkmak

回答4:

Chopping at html with string functions or regex is not a reliable method. DomDocument and Xpath do a nice job.

Code: (Demo)

$dom=new DOMDocument; 
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach ($xpath->evaluate("//td[@class = 'FootNotes2']/a") as $node) {  // target a tags that have <td class="FootNotes2"> as parent
    $result[]=['href' => $node->getAttribute('href'), 'text' => $node->nodeValue];  // extract/store the href and text values
    if (sizeof($result) == 10) { break; }  // set a limit of 10 rows of data
}
if (isset($result)) {
    echo "<ul>\n";
    foreach ($result as $data) {
        echo "\t<li class=\"itemtitle\"><a href=\"{$data['href']}\" target=\"_blank\">{$data['text']}</a></li>\n";
    }
    echo "</ul>";
}

Sample Input:

$html = <<<HTML
<table border="0" cellpadding="0" cellspacing="0" width="100%" class="Table2">
<tbody><tr>
    <td width="1%" valign="top" class="Title2">&nbsp;</td>
    <td width="65%" valign="top" class="Title2">Subject</td>
    <td width="1%" valign="top" class="Title2">&nbsp;</td>
    <td width="14%" valign="top" align="Center" class="Title2">Last Update</td>
    <td width="1%" valign="top" class="Title2">&nbsp;</td>
    <td width="8%" valign="top" align="Center" class="Title2">Replies</td>
    <td width="1%" valign="top" class="Title2">&nbsp;</td>
    <td width="9%" valign="top" align="Center" class="Title2">Views</td>
</tr>
<tr>
    <td width="1%" height="25">&nbsp;</td>
    <td width="64%" height="25" class="FootNotes2"><a href="/files/forum/2017/1/837110.php" target="_top" class="Links2">Serious dedicated study partner for U World</a> - step12013</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="14%" height="25" class="FootNotes2" align="center">02/11/17 01:50</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="8%" height="25" align="Center" class="FootNotes2">10</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="9%" height="25" align="Center" class="FootNotes2">318</td>
</tr>
<tr>
    <td width="1%" height="25">&nbsp;</td>
    <td width="64%" height="25" class="FootNotes2"><a href="/files/forum/2017/1/837999.php" target="_top" class="Links2">some text</a> - step12013</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="14%" height="25" class="FootNotes2" align="center">02/11/17 01:50</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="8%" height="25" align="Center" class="FootNotes2">10</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="9%" height="25" align="Center" class="FootNotes2">318</td>
</tr>
</tbody>
</table>
HTML;

Output:

<ul>
    <li class="itemtitle"><a href="/files/forum/2017/1/837110.php" target="_blank">Serious dedicated study partner for U World</a></li>
    <li class="itemtitle"><a href="/files/forum/2017/1/837999.php" target="_blank">some text</a></li>
</ul>