I'm trying to parse some HTML with DOM in PHP, but I'm having some problems. First, in case this change the solution, the HTML that I have is not a full page, rather, it's only part of it.
<!-- This is the HTML that I have --><a href='/games/'>
<div id='game'>
<img src='http://images.example.com/games.gif' width='300' height='137' border='0'>
<br><b> Game </b>
</div>
<div id='double'>
<img src='http://images.example.com/double.gif' width='300' height='27' border='0' alt='' title=''>
</div>
</a>
Now I'm trying to get only the div with the id double
. I've tried the following code, but it doesn't seem to be working properly. What might I be doing wrong?
//The HTML has been loaded into the variable $html
$dom=new domDocument;
$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$keepme = $dom->getElementById('double');
$contents = '<div style="text-align:center">'.$keepme.'</a></div>';
echo $contents;
From
DomDocument::getElementById
For some additional information
And since someone will mention doing it with a Regular Expression sooner or later, here is the pattern you could use:
/<div id='double'>(.*)<\/div>/simU
In addition, you could just use regular string functions to extract the div part, e.g.
While I agree, you should not use RegEx or String functions for parsing HTML or XML, I find it absolutely okay to do so, as long as your only concern is to get this single div from the fragments. Keep it simple.
The fragment is HTML, but to be parsed through DOM it should XHTML. Every open tag must be closed.
In your case it means you should replace
<br>
with<br />
and<img ... >
with<img ... />
HTML Tidy should be capable of "correcting" broken and fragmented HTML documents, turning them into something that can be parsed with other tools
http://devzone.zend.com/article/761
An XML document can only have one element at the root level. Probably, the HTML parser has a similar requirement. Try wrapping the content in a<body/>
tag.Seems it's something else. This page describes what may be the cause. I'd recommend that you use XPath to get the element.
I think
DOMDocument::getElementById
will not work in your case : (quoting)A solution that might work is using some XPath query to extract the element you are looking for.
First of all, let's load the HTML portion, like you first did :
The
var_dump
is here only to prove that the HTML portion has been loaded successfully -- judging from its output, it has.Then, instanciate the
DOMXPath
class, and use it to query for the element you want to get :We now have to element you want ;-)
But, in order to inject its HTML content in another HTML segment, we must first get its HTML content.
I don't remember any "easy" way to do that, but something like this sould do the trick :
And... We have the HTML content of your
double
<div>
:Now, you just have to do whatever you want with it ;-)