Simplexml: parsing HTML leaves out nested elements

2019-07-22 06:43发布

问题:

I'm trying to parse a specific html document, some sort of a dictionary, with about 10000 words and description. It went well until I've noticed that entries in specific format doesn't get parsed well.

Here is an example:

    <?php
    $html = '
        <p>
            <b>
                <span>zot; zotz </span>
            </b>
            <span>Nista; nula. Isto
                <b>zilch; zip.</b>
            </span>
        </p>
        ';

    $xml = simplexml_load_string($html);

    var_dump($xml);
    ?>

Result of var_dump() is:

    object(SimpleXMLElement)#1 (2) {
      ["b"]=>
      object(SimpleXMLElement)#2 (1) {
        ["span"]=>
        string(10) "zot; zotz "
      }
      ["span"]=>
      string(39) "Nista; nula. Isto

            "
    }

As you can see - Simplexml kept text node inside tag but left out a child node and text inside.

I've also tried:

    $doc = new DOMDocument();
    $doc->loadHTML($html);
    $xml = simplexml_import_dom($doc);

with the same result.

As it looked to me that this is a common problem in parsing html I tried googling it out but only place that acknowledges this problem is this blog: https://hakre.wordpress.com/2013/07/09/simplexml-and-json-encode-in-php-part-i/ but does not offer any solution.

There is just too generalized posts and answers about parsing HTML in SO.

Is there a simple way of dealing with this? Or, should I change my strategy?

回答1:

Your observation is correct: SimpleXML does only offer the child element-node here, not the child text-nodes. The solution is to switch to DOMDocument as it can access all nodes there, text and element children.

// first span element
$span = dom_import_simplexml($xml->span);

foreach ($span->childNodes as $child) {
    printf(" - %s : %s\n", get_class($child), $child->nodeValue );
}

This example shows that dom_import_simplexml is used on the more specific <span> element-node and the traversal is the done over the children of the according DOMElement object.

The output:

 - DOMText : Nista; nula. Isto

 - DOMElement : zilch; zip.
 - DOMText : 

The first entry is the first text-node within the <span> element. It is followed by the <b> element (which again contains some text) and then from another text-node that consists of whitespace only.

The dom_import_simplexml function is especially useful when SimpleXMLElement is too simple for more differentiated data access within the XML document. Like in the case you face here.

The example in full:

$html = <<<HTML
<p>
    <b>
        <span>zot; zotz </span>
    </b>
    <span>Nista; nula. Isto
        <b>zilch; zip.</b>
    </span>
</p>
HTML;

$xml = simplexml_load_string($html);

// first span element
$span = dom_import_simplexml($xml->span);

foreach ($span->childNodes as $child) {
    printf(" - %s : %s\n", get_class($child), $child->nodeValue );
}