Getting price from Amazon with Xpath

2019-03-05 14:04发布

in the following page:

http://www.amazon.com/Jessica-Simpson-Womens-Double-Breasted/dp/B00K65ZMCA/ref=sr_1_4_mc/185-0705108-6790969?s=apparel&ie=UTF8&qid=1413083859&sr=1-4 I am trying to get the price with the expression

'//span[@id="priceblock_ourprice"]'

but the result is an empty variable.

the interesting part is that In other amazon pages, like this one : http://www.amazon.com/SanDisk-Cruzer-Frustration-Free-Packaging--SDCZ36-032G-AFFP/dp/B007JR532M/ref=sr_1_1?s=pc&ie=UTF8&qid=1413084653&sr=1-1&keywords=usb

I do have an expression that works

'//b[@class="priceLarge"]'

But i dont even know why, because in the source of the page I cant find such a tag... So why does it work? and how do I get the price on the first page? Thanks!

1条回答
做个烂人
2楼-- · 2019-03-05 14:27

When scraping with PHP you can not just take what you see in the browser source for granted.

Instead you first need to fetch the content with PHP and then look into the source there:

$url    = 'http://www.amazon.com/ ... ';
$buffer = file_get_contents($url);

The variable $buffer then contains the HTML that you will be scraping.

Done that with your example links will show that for the first and second address both have an element of .priceLarge containing probably what you're looking for:

<span class="priceLarge">$168.00</span>
<b class="priceLarge">$14.99</b>

After finding out where the data is you're looking for, you can create the DOMDocument:

$doc          = new DOMDocument();
$doc->recover = true;
$saved        = libxml_use_internal_errors(true);
$doc->loadHTML($buffer);

You might also be interested in parsing errors:

/** @var array|LibXMLError[] $errors */
$errors = libxml_get_errors();
foreach ($errors as $error) {
    printf(
        "%s: (%d) [%' 3d] #%05d:%' -4d %s\n", get_class($error), $error->level, $error->code, $error->line,
        $error->column, rtrim($error->message)
    );
}
libxml_use_internal_errors($saved);

as this is a way that DOMDocument tells you where problems occured. For example duplicate ID values.

After loading the buffer into DOMDocument you can create the DOMXPath:

$xp = new DOMXPath($doc);

You will use it to obtain the actual values from the document.

For example those two example addresses HTML hasshown that the information you're looking for is the #priceBlock both containing .listprice and .priceLarge:

$priceBlock = $doc->getElementById('priceBlock');
printf(
    "List Price: %s\nPrice: %s\n"
    , $xp->evaluate('string(.//*[@class="listprice"])', $priceBlock)
    , $xp->evaluate('string(.//*[@class="priceLarge"])', $priceBlock)
);

Which will result in the following output:

List Price: $48.99
Price: $14.99

If you're missing something, obtaining a parent section element into a variable as $priceBlock in the example does not only allow you to use relative paths for Xpath but can also help with debugging in case you're missing some of the more detailed information:

echo $doc->saveHTML($priceBlock);

This outputs the whole <div> that contains all pricing information for example.

If you setup yourself some helper classes, you can further on use this to obtain other useful information from the document for scraping it, like showing all tag/class combinations within the price-block:

// you can find StringCollector at the end of the answer
$tagsWithClass = new StringCollector();
foreach ($xp->evaluate('.//*/@class', $priceBlock) as $class) {
    $tagsWithClass->add(sprintf("%s.%s", $class->parentNode->tagName, $class->value));
}
echo $tagsWithClass;

This then outputs the list of collected strings and their count which is here the tagnames with their class attribute values:

table.product (1)
td.priceBlockLabel (3)
span.listprice (1)
td.priceBlockLabelPrice (1)
b.priceLarge (1)
tr.youSavePriceRow (1)
td.price (1)

As you can see, this is from the first example URL because .pricelarge is with a <b> element.

This is a relative simple helper, for scraping you can do more, like displaying the whole HTML structure in form of a tree.

DomTree::dump($priceBlock);

It will give you the following output which allows for better consumption than just DOMDocument::saveHTML($node):

`<div id="priceBlock" class="buying">
  +"\n\n  "
  `<table class="product">
    +<tr>
    | +<td class="priceBlockLabel">
    | | `"List Price:"
    | +"\n    "
    | +<td>
    | | `<span id="listPriceValue" class="listprice">
    | |   `"$48.99"
    | `"\n  "
    +<tr id="actualPriceRow">
    | +<td id="actualPriceLabel" class="priceBlockLabelPrice">
    | | `"Price:"
    | +"\n    "
    | +<td id="actualPriceContent">
    | | +<span id="actualPriceValue">
    | | | `<b class="priceLarge">
    | | |   `"$14.99"
    | | +"\n    "
    | | `<span id="actualPriceExtraMessaging">
    | |   +"\n        \n\n\n    "
    | |   +<span>
    | |   | `"\n        \n    "
    | |   +"\n    \n\n\n\n\n\n\n\n\n\n    \n\n\n\n\n\n \n\n\n\n\n& "
    | |   +<b>
    | |   | `"FREE Shipping"
    | |   +" on orders over $35.\n\n\n\n"
    | |   +<a href="/gp/help/customer/display.html/ref=mk_sss_dp_1/191-4381493-1931545?ie=UTF8&no...">
    | |   | `"Details"
    | |   `"\n\n\n\n\n\n\n\n\n    \n\n    \n    \n\n\n\n\n\n      \n"
    | `"\n"
    +<tr id="dealPriceRow">
    | +<td id="dealPriceLabel" class="priceBlockLabel">
    | | `"Deal Price: "
    | +"\n  "
    | +<td id="dealPriceContent">
    | | +"\n    "
    | | +<span id="dealPriceValue">
    | | +"\n    "
    | | +<span id="dealPriceExtraMessaging">
    | | `"\n  "
    | `"\n"
    +<script>
    | `[XML_CDATA_SECTION_NODE (4)]
    +<tr id="youSaveRow" class="youSavePriceRow">
    | +<td id="youSaveLabel" class="priceBlockLabel">
    | | `"You Save:"
    | +"\n    "
    | +<td id="youSaveContent" class="price">
    | | +<span id="youSaveValue">
    | | | `"$34.00\n        (69%)"
    | | `"\n    "
    | `"\n  "
    `<tr>
      +<td>
      `<td>
        `<span>
          `"o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o..."

You can find it referenced in an answer to Debug a DOMDocument Object in PHP and in another one. The code is available on github as a gist.


The StringCollector helper class

/**
 * Class StringCollector
 *
 * Collect strings and count them
 */
class StringCollector implements IteratorAggregate
{
    private $array;

    public function add($string)
    {
        $entry = & $this->array[$string];
        $entry++;
    }

    public function getIterator()
    {
        return new ArrayIterator($this->array);
    }

    public function __toString()
    {
        $buffer = '';
        foreach ($this as $string => $count) {
            $buffer .= sprintf("%s (%d)\n", $string, $count);
        }
        return $buffer;
    }
}
查看更多
登录 后发表回答