Website Scraping Using PHP

2019-06-02 19:33发布

I have a php code that could extract the product categories in this website: http://www.tradeindia.com/. So far I had managed to extract only the categories. How do I make it so that it will also extract the product numbers beside it since its not in any class name?

My code:

<?php 
//header('Content-Type: text/html; charset=utf-8'); 
$grep = new DoMDocument(); 
@$grep->loadHTMLFile("http://www.tradeindia.com/"); 
$finder = new DomXPath($grep); 
$class = "cate_menu"; 
$nodes = $finder->query("//*[contains(@class, '$class')]"); 

$total_L = 0; 
foreach ($nodes as $node) { 
$span = $node->childNodes; 
echo '<br>' . $span->item(0)->nodeValue . ' : '; 
} 

?> 

Source code from website:

<td align="left" style="padding-left:8px;color:blue"><a href=/Seller/Agriculture/ class="cate_menu" >Agriculture</a>(100892)</td>
<td align="left" style="padding-left:8px;color:blue"><a href=/Seller/Apparel-Fashion/ class="cate_menu" >Apparel & Fashion</a>(237902)</td>
<td align="left" style="padding-left:8px;color:blue"><a href=/Seller/Automobile/ class="cate_menu" >Automobile</a>(78614)</td>

I need the numbers between brackets.

2条回答
三岁会撩人
2楼-- · 2019-06-02 20:10

I'm not an xpath guru, but what I would do is to target first that particular table using that needle categories, then from there get those rows based on that and start looping on found rows.

Rough example:

$grep = new DOMDocument();
@$grep->loadHTMLFile("http://www.tradeindia.com/");
$finder = new DOMXpath($grep);

$products = array();
$nodes = $finder->query("
    //td[@class='showroom1'][contains(text(), 'CATEGORIES')]
    /parent::tr/parent::table/parent::td/parent::tr
    /following-sibling::tr
    /td[1]/table/tr/td/table/tr
");

if($nodes->length > 0) {
    foreach($nodes as $tr) {
        if($finder->evaluate('count(./td/a)', $tr) > 0) {
            foreach($finder->query('./td/a[@class="cate_menu"]', $tr) as $row) {
                $text = $row->nodeValue;
                $number = $finder->query('./following-sibling::text()', $row)->item(0)->nodeValue;
                $products[] = "$text $number";
            }

        }
    }
}

echo '<pre>';
print_r($products);

Sample Output

查看更多
Bombasti
3楼-- · 2019-06-02 20:16

Since the number is between two brackets, this should be easy. You can use a function like this;

function get_string_between($string, $start, $end) {
    $string = " ".$string;
    $ini = strpos($string,$start);
    if ($ini == 0) return "";
    $ini += strlen($start);   
    $len = strpos($string,$end,$ini) - $ini;
    return substr($string,$ini,$len);
}

$product = get_string_between($htmlline, "(", ")");

You will need to get each line of the table inserted separately though. You could loop through an array of strings containing each line; foreach($htmllines as $htmlline) or similar.

Hope this helps.

查看更多
登录 后发表回答