in the following page:
http://www.amazon.com/Jessica-Simpson-Womens-Double-Breasted/dp/B00K65ZMCA/ref=sr_1_4_mc/185-0705108-6790969?s=apparel&ie=UTF8&qid=1413083859&sr=1-4 I am trying to get the price with the expression
'//span[@id="priceblock_ourprice"]'
but the result is an empty variable.
the interesting part is that In other amazon pages, like this one : http://www.amazon.com/SanDisk-Cruzer-Frustration-Free-Packaging--SDCZ36-032G-AFFP/dp/B007JR532M/ref=sr_1_1?s=pc&ie=UTF8&qid=1413084653&sr=1-1&keywords=usb
I do have an expression that works
'//b[@class="priceLarge"]'
But i dont even know why, because in the source of the page I cant find such a tag... So why does it work? and how do I get the price on the first page? Thanks!
When scraping with PHP you can not just take what you see in the browser source for granted.
Instead you first need to fetch the content with PHP and then look into the source there:
The variable
$buffer
then contains the HTML that you will be scraping.Done that with your example links will show that for the first and second address both have an element of
.priceLarge
containing probably what you're looking for:After finding out where the data is you're looking for, you can create the DOMDocument:
You might also be interested in parsing errors:
as this is a way that DOMDocument tells you where problems occured. For example duplicate ID values.
After loading the buffer into DOMDocument you can create the DOMXPath:
You will use it to obtain the actual values from the document.
For example those two example addresses HTML hasshown that the information you're looking for is the
#priceBlock
both containing.listprice
and.priceLarge
:Which will result in the following output:
If you're missing something, obtaining a parent section element into a variable as
$priceBlock
in the example does not only allow you to use relative paths for Xpath but can also help with debugging in case you're missing some of the more detailed information:This outputs the whole
<div>
that contains all pricing information for example.If you setup yourself some helper classes, you can further on use this to obtain other useful information from the document for scraping it, like showing all tag/class combinations within the price-block:
This then outputs the list of collected strings and their count which is here the tagnames with their class attribute values:
As you can see, this is from the first example URL because
.pricelarge
is with a<b>
element.This is a relative simple helper, for scraping you can do more, like displaying the whole HTML structure in form of a tree.
It will give you the following output which allows for better consumption than just
DOMDocument::saveHTML($node)
:You can find it referenced in an answer to Debug a DOMDocument Object in PHP and in another one. The code is available on github as a gist.
The StringCollector helper class