Extracting node values using XPath

There is a section of amazon.com from which I want to extract the data (node value only, not the link) for each item.

The value I'm looking for is inside and <span class="narrowValue">

<ul data-typeid="n" id="ref_1000">
    <li style="margin-left: -18px">
        <a href="/s/ref=sr_ex_n_0?rh=i%3Aaps%2Ck%3Ahow+to+grow+tomatoes&amp;sort=salesrank&amp;keywords=how+to+grow+tomatoes&amp;ie=UTF8&amp;qid=1327603358">
            <span class="expand">Any Department</span>
        </a>
    </li>
    <li style="margin-left: 8px">
        <strong>Books</strong>
    </li>
    <li style="margin-left: 6px">
        <a href="/s/ref=sr_nr_n_0?rh=k%3Ahow+to+grow+tomatoes%2Cn%3A283155%2Cp_n_feature_browse-bin%3A618073011%2Cn%3A%211000%2Cn%3A48&amp;bbn=1000&amp;sort=salesrank&amp;keywords=how+to+grow+tomatoes&amp;ie=UTF8&amp;qid=1327603358&amp;rnid=1000">
            <span class="refinementLink">Crafts, Hobbies & Home</span><span class="narrowValue">(19)</span>
        </a>
    </li>
    <li style="margin-left: 6px">
       <a href="/s/ref=sr_nr_n_1?rh=k%3Ahow+to+grow+tomatoes%2Cn%3A283155%2Cp_n_feature_browse-bin%3A618073011%2Cn%3A%211000%2Cn%3A10&amp;bbn=1000&amp;sort=salesrank&amp;keywords=how+to+grow+tomatoes&amp;ie=UTF8&amp;qid=1327603358&amp;rnid=1000">
            <span class="refinementLink">Health, Fitness & Dieting</span><span class="narrowValue">(3)</span>
        </a>
    </li>
    <li style="margin-left: 6px">
        <a href="/s/ref=sr_nr_n_2?rh=k%3Ahow+to+grow+tomatoes%2Cn%3A283155%2Cp_n_feature_browse-bin%3A618073011%2Cn%3A%211000%2Cn%3A6&amp;bbn=1000&amp;sort=salesrank&amp;keywords=how+to+grow+tomatoes&amp;ie=UTF8&amp;qid=1327603358&amp;rnid=1000">
            <span class="refinementLink">Cookbooks, Food & Wine</span><span class="narrowValue">(2)</span>
        </a>
    </li>
</ul>

How could I do this with XPath?

the code is from the link amazon kindle search

currently i am trying

$rank=array();

$words = $xpath->query('//ul[@id="ref_1000"]/li/a/span[@class="refinementLink"]');
foreach ($words as $word) {

        $rank[]=(trim($word->nodeValue));


 }
 var_dump($rank);

标签： php xpath html-parsing

3条回答

Deceive 欺骗

2楼-- · 2020-04-23 07:08

The following expression should work:

//*[@id='ref_1000']/li/a/span[@class='narrowValue']

For better performance you could provide a direct path to the start of this expression, but the one provided is more flexible (given that you probably need this to work across multiple pages).

Keep in mind, also, that your HTML parser might generate a different result tree than the one produced by Firebug (where I tested). Here's an even more flexible solution:

//*[@id='ref_1000']//span[@class='narrowValue']

Flexibility comes with potential performance (and accuracy) costs, but it's often the only choice when dealing with tag soup.

0人赞添加讨论(0) 举报

叛逆

3楼-- · 2020-04-23 07:19

If you need to grap the categories names:

// Suppress invalid markup warnings
libxml_use_internal_errors(true);

// Create SimpleXML object
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->loadHTML($html); // $html - string fetched by CURL 
$xml = simplexml_import_dom($doc);

// Find a category nodes
$categories = $xml->xpath("//span[@class='refinementLink']");

EDIT. Using DOMDocument

$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->loadHTML($html);

$xpath = new DOMXPath($doc);

// Select the parent node
$categories = $xpath->query("//span[@class='refinementLink']/..");

foreach ($categories as $category) {
    echo '<pre>';
    echo $category->childNodes->item(1)->firstChild->nodeValue; 
    echo $category->childNodes->item(2)->firstChild->nodeValue;
    echo '</pre>';
    // Crafts, Hobbies & Home (19)
}

0人赞添加讨论(0) 举报

Melony?

4楼-- · 2020-04-23 07:29

I'd highly recommend you checkout the phpQuery library. It's essentially the jQuery selectors engine for PHP, so to get at the text you're wanting you could do something like:

foreach (pq('span.refinementLink') as $p) {
  print $p->text() . "\n";
}

That should output something like:

Crafts, Hobbies & Home
Health, Fitness & Dieting
Cookbooks, Food & Wine

It's by far the easiest screen scraping, DOM parsing thing I know of for PHP.

0人赞添加讨论(0) 举报

Extracting node values using XPath

EDIT. Using DOMDocument

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间