Extracting HTML from an XML file using simpleXML

2019-02-27 15:47发布

问题:

I'm reading an xml file generated by a 3rd-party application that includes the following:

<Cell>
    <Comment ss:Author="Mark Baker">
        <ss:Data xmlns="http://www.w3.org/TR/REC-html40"><B><Font html:Face="Tahoma" html:Size="8" html:Color="#000000">Mark Baker:</Font></B><Font html:Face="Tahoma" html:Size="8" html:Color="#000000">&#10;Comment 1 - No align</Font></ss:Data>
    </Comment>
</Cell>

What I'm trying to do is access the raw data from the Cell->Comment->Data element either "as is" or as an actual block of (X)HTML markup (preferably the latter).

if (isset($cell->Comment)) {
    echo 'comment found<br />';
    $commentAttributes = $cell->Comment->attributes($namespaces['ss']);
    if (isset($commentAttributes->Author)) {
        echo 'Author: ',(string)$commentAttributes->Author,'<br />';
    }
    $commentData = $cell->Comment->children($namespaces['ss']);
    var_dump($commentData);
    echo '<br />';
}

gives me:

comment found
Author: Mark Baker
object(SimpleXMLElement)#130 (2) { ["@attributes"]=> array(1) { ["Author"]=> string(10) "Mark Baker" } ["Data"]=> object(SimpleXMLElement)#129 (0) { } } 

while

if (isset($cell->Comment)) {
    echo 'comment found<br />';
    $commentAttributes = $cell->Comment->attributes($namespaces['ss']);
    if (isset($commentAttributes->Author)) {
        echo 'Author: ',(string)$commentAttributes->Author,'<br />';
    }
    $commentData = $cell->Comment->Data->children();
    var_dump($commentData);
    echo '<br />';
}

gives me:

comment found
Author: Mark Baker
object(SimpleXMLElement)#129 (2) { ["B"]=> object(SimpleXMLElement)#118 (1) { ["Font"]=> string(11) "Mark Baker:" } ["Font"]=> string(21) " Comment 1 - No align" } 

Unfortunately, simpleXML seems to be treating the whole element as a series of XML nodes. I'm sure I should be able to get this is raw data without complex looping, or feeding the element to a DOM Parser; perhaps using the xmlns="http://www.w3.org/TR/REC-html40" namespace to extract this cleanly, but I can't figure out how.

Any help appreciated.

A more complex example of the XML data:

<Cell>
    <Comment ss:Author="Mark Baker">
        <ss:Data xmlns="http://www.w3.org/TR/REC-html40">
            <B><Font html:Face="Tahoma" html:Size="8" html:Color="#000000">Mark Baker:</Font></B><Font html:Face="Tahoma" html:Size="8" html:Color="#000000">&#10;</Font><B><Font html:Face="Tahoma" x:Family="Swiss" html:Size="8" html:Color="#000000">Rich </Font><U><Font html:Face="Tahoma" x:Family="Swiss" html:Size="8" html:Color="#FF0000">Text </Font></U><Font html:Face="Tahoma" x:Family="Swiss" html:Size="8" html:Color="#000000">Comment</Font></B><Font html:Face="Tahoma" html:Size="8" html:Color="#000000"> Center Aligned</Font>
        </ss:Data>
    </Comment>
</Cell>

回答1:

If your implementation were to use DOM, I believe you could do the following:

//given $node is <ss:data>

$frag = $node->ownerDocument->createDocumentFragment();
foreach($node->childNodes as $child){
    $frag->appendChild($child->cloneNode(true));
}
$string = $node->ownerDocument->saveXML($frag);


回答2:

If the HTML inside the <ss:Data> element is considered to be a string literal, it has to be wrapped into a CDATA section as was already hinted in the comments

$xml = <<< XML
<Cell>
    <Comment ss:Author="Mark Baker">
        <ss:Data xmlns="http://www.w3.org/TR/REC-html40">
            <![CDATA[
                <B><Font html:Face="Tahoma" … html:Color="#000000">
            ]]>
        </ss:Data>
    </Comment>
</Cell>
XML;
libxml_use_internal_errors(TRUE);
$cell = simplexml_load_string($xml);
echo $cell->Comment->Data;

If it's not in a CDATA section, it will be considered nodes. Then you'd be looking for the innerXml of the <ss:Data> to get that as raw XML. Unfortunately, neither SimpleXml, nor DOM have a native way to fetch that directly. You'd have to use a userland implementation.

Userland implementations of innerXml usually either iterate over all the child nodes and concatenate their raw XML. Or they dump the entire tree and string replace the root node. Or they create a fragment or import the nodes into another document.

I am not aware of any other way to do that. Not sure if this would be possible with XSLT. XMLReader has a readInnerXML method though.



回答3:

I've gone with a quick and dirty solution for the time being. In the longer term, I'll switch to using XMLReader (for all the reasons mentioned)... I just don't have the time to rewrite all the existing simpleXML code at the moment.

I've gone with:

$node = $cell->Comment->Data->asXML();
$comment = substr($node,49,-10);
$comment = strip_tags($comment);

While I'd prefer to keep the HTML markup, that will require additional work, so I'm simply stripping all the markup leaving me with the plain text (which is the critical element).

While this is a far from perfect solution, it does what I need it to do (for the moment), and I can move on to the next item in my "to do" list, having already added a new item of "rewrite using XMLReader" to that list.

Thanks for the help. I'll be sure to revisit this thread when I am doing that rewrite.



回答4:

So I know your question has come and gone, but I had the same issue and I had to figure out how I wanted to handle it as well. For future generations, here's how I got it.

If you're only accepting (x)HTML:

$data = str_replace('<?xml version="1.0"?>','',$xmlNode->asXML());

If you think someone's going to put in XML and you're OK with that, you'll only want to kill the first, automatically generated XML tag:

$data = preg_replace('/^<\?xml version="1.0"\?\>\n/', '',$xmlNode->asXML());

So your code would look like this:

if (isset($cell->Comment)) {
    echo 'comment found<br />';
    $commentAttributes = $cell->Comment->attributes($namespaces['ss']);
    if (isset($commentAttributes->Author)) {
        echo 'Author: ',(string)$commentAttributes->Author,'<br />';
    }
    $commentData = str_replace('<?xml version="1.0"?>','',$cell->Comment->Data->asXML());
    echo $commentData;
    echo '<br />';
}