I'm reading an xml file generated by a 3rd-party application that includes the following:
<Cell>
<Comment ss:Author="Mark Baker">
<ss:Data xmlns="http://www.w3.org/TR/REC-html40"><B><Font html:Face="Tahoma" html:Size="8" html:Color="#000000">Mark Baker:</Font></B><Font html:Face="Tahoma" html:Size="8" html:Color="#000000"> Comment 1 - No align</Font></ss:Data>
</Comment>
</Cell>
What I'm trying to do is access the raw data from the Cell->Comment->Data element either "as is" or as an actual block of (X)HTML markup (preferably the latter).
if (isset($cell->Comment)) {
echo 'comment found<br />';
$commentAttributes = $cell->Comment->attributes($namespaces['ss']);
if (isset($commentAttributes->Author)) {
echo 'Author: ',(string)$commentAttributes->Author,'<br />';
}
$commentData = $cell->Comment->children($namespaces['ss']);
var_dump($commentData);
echo '<br />';
}
gives me:
comment found
Author: Mark Baker
object(SimpleXMLElement)#130 (2) { ["@attributes"]=> array(1) { ["Author"]=> string(10) "Mark Baker" } ["Data"]=> object(SimpleXMLElement)#129 (0) { } }
while
if (isset($cell->Comment)) {
echo 'comment found<br />';
$commentAttributes = $cell->Comment->attributes($namespaces['ss']);
if (isset($commentAttributes->Author)) {
echo 'Author: ',(string)$commentAttributes->Author,'<br />';
}
$commentData = $cell->Comment->Data->children();
var_dump($commentData);
echo '<br />';
}
gives me:
comment found
Author: Mark Baker
object(SimpleXMLElement)#129 (2) { ["B"]=> object(SimpleXMLElement)#118 (1) { ["Font"]=> string(11) "Mark Baker:" } ["Font"]=> string(21) " Comment 1 - No align" }
Unfortunately, simpleXML seems to be treating the whole element as a series of XML nodes. I'm sure I should be able to get this is raw data without complex looping, or feeding the element to a DOM Parser; perhaps using the xmlns="http://www.w3.org/TR/REC-html40" namespace to extract this cleanly, but I can't figure out how.
Any help appreciated.
A more complex example of the XML data:
<Cell>
<Comment ss:Author="Mark Baker">
<ss:Data xmlns="http://www.w3.org/TR/REC-html40">
<B><Font html:Face="Tahoma" html:Size="8" html:Color="#000000">Mark Baker:</Font></B><Font html:Face="Tahoma" html:Size="8" html:Color="#000000"> </Font><B><Font html:Face="Tahoma" x:Family="Swiss" html:Size="8" html:Color="#000000">Rich </Font><U><Font html:Face="Tahoma" x:Family="Swiss" html:Size="8" html:Color="#FF0000">Text </Font></U><Font html:Face="Tahoma" x:Family="Swiss" html:Size="8" html:Color="#000000">Comment</Font></B><Font html:Face="Tahoma" html:Size="8" html:Color="#000000"> Center Aligned</Font>
</ss:Data>
</Comment>
</Cell>
If your implementation were to use DOM
, I believe you could do the following:
//given $node is <ss:data>
$frag = $node->ownerDocument->createDocumentFragment();
foreach($node->childNodes as $child){
$frag->appendChild($child->cloneNode(true));
}
$string = $node->ownerDocument->saveXML($frag);
If the HTML inside the <ss:Data>
element is considered to be a string literal, it has to be wrapped into a CDATA section as was already hinted in the comments
$xml = <<< XML
<Cell>
<Comment ss:Author="Mark Baker">
<ss:Data xmlns="http://www.w3.org/TR/REC-html40">
<![CDATA[
<B><Font html:Face="Tahoma" … html:Color="#000000">
]]>
</ss:Data>
</Comment>
</Cell>
XML;
libxml_use_internal_errors(TRUE);
$cell = simplexml_load_string($xml);
echo $cell->Comment->Data;
If it's not in a CDATA section, it will be considered nodes. Then you'd be looking for the innerXml of the <ss:Data>
to get that as raw XML. Unfortunately, neither SimpleXml, nor DOM have a native way to fetch that directly. You'd have to use a userland implementation.
Userland implementations of innerXml usually either iterate over all the child nodes and concatenate their raw XML. Or they dump the entire tree and string replace the root node. Or they create a fragment or import the nodes into another document.
I am not aware of any other way to do that. Not sure if this would be possible with XSLT
. XMLReader
has a readInnerXML
method though.
I've gone with a quick and dirty solution for the time being. In the longer term, I'll switch to using XMLReader (for all the reasons mentioned)... I just don't have the time to rewrite all the existing simpleXML code at the moment.
I've gone with:
$node = $cell->Comment->Data->asXML();
$comment = substr($node,49,-10);
$comment = strip_tags($comment);
While I'd prefer to keep the HTML markup, that will require additional work, so I'm simply stripping all the markup leaving me with the plain text (which is the critical element).
While this is a far from perfect solution, it does what I need it to do (for the moment), and I can move on to the next item in my "to do" list, having already added a new item of "rewrite using XMLReader" to that list.
Thanks for the help. I'll be sure to revisit this thread when I am doing that rewrite.
So I know your question has come and gone, but I had the same issue and I had to figure out how I wanted to handle it as well. For future generations, here's how I got it.
If you're only accepting (x)HTML:
$data = str_replace('<?xml version="1.0"?>','',$xmlNode->asXML());
If you think someone's going to put in XML and you're OK with that, you'll only want to kill the first, automatically generated XML tag:
$data = preg_replace('/^<\?xml version="1.0"\?\>\n/', '',$xmlNode->asXML());
So your code would look like this:
if (isset($cell->Comment)) {
echo 'comment found<br />';
$commentAttributes = $cell->Comment->attributes($namespaces['ss']);
if (isset($commentAttributes->Author)) {
echo 'Author: ',(string)$commentAttributes->Author,'<br />';
}
$commentData = str_replace('<?xml version="1.0"?>','',$cell->Comment->Data->asXML());
echo $commentData;
echo '<br />';
}