可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I am parsing XML in PHP with SimpleXML and have an XML like this:

<xml>
    <element>
        textpart1
            <subelement>subcontent1</subelement>
        textpart2
            <subelement>subcontent2</subelement>
        textpart3
    </element>
</xml>

When I do $xml->element it naturally gives me the whole element, as in all three textparts.

So if I parse this into an array (with a foreach for the children) I get:

0 => textpart1textpart2textpart3, 1 => subcontent1, 2 => subcontent2

I need a way to parse the <element> node so that each textpart that stops at, or begins after a subelement is treated as its own element.

As a result I am looking for an ordered list that could be express in an array like this:

0 => textpart1, 1 => subcontent1, 2 => textpart2, 3 => subcontent2, 4 => textpart3

Is that possible without altering the XML file? Thanks in advance for any hints!

回答1:

As others have said, SimpleXML doesn't have any support for accessing individual text nodes as separate entities, so you will need to supplement it with some DOM methods. Thankfully, you can switch between the two at will using dom_import_simplexml and simplexml_import_dom.

The key pieces of DOM functionality you need are:

the DOMElement->childNodes member variable for accessing all nodes directly under a particular element as an iterable list
the DOMNode->nodeType variable for determining if a particular child is a text node or an element
the DOMNode->nodeValue variable to get the actual text

Given those, you can write a function which returns an array with a mixture of SimpleXML objects for child elements, and strings for child text nodes, something like this:

function get_child_elements_and_text_nodes($sx_element)
{
    $return = array();

    $dom_element = dom_import_simplexml($sx_element);
    foreach ( $dom_element->childNodes as $dom_child )
    {
        switch ( $dom_child->nodeType )
        {
            case XML_TEXT_NODE:
                $return[] = $dom_child->nodeValue;
            break;
            case XML_ELEMENT_NODE:
                $return[] = simplexml_import_dom($dom_child);
            break;
        }
    }

    return $return;
}

In your case, you need to recurse down the tree, which makes it a little confusing if you mix DOM and SimpleXML as you go, so you could instead write the recursion entirely in DOM and convert the SimpleXML object before running it:

function recursively_find_text_nodes($dom_element)
{
    $return = array();

    foreach ( $dom_element->childNodes as $dom_child )
    {
        switch ( $dom_child->nodeType )
        {
            case XML_TEXT_NODE:
                $return[] = $dom_child->nodeValue;
            break;
            case XML_ELEMENT_NODE:
                $return = array_merge($return, recursively_find_text_nodes($dom_child));
            break;
        }
    }

    return $return;
}

$text_nodes = recursively_find_text_nodes(dom_import_simplexml($simplexml->element));

Here's a live demo of that last function.

回答2:

The simple answer is no. SimpleXML does not implement any kind of support for text nodes.
In this case your best and preferred option is to use DOMDocument.

回答3:

You are actually looking for all text-nodes that are descendants of the element element node. This can be expressed as the following xpath:

/*/element//text()

Even SimpleXML has an xpath method that does execute this query without any errors, the actual text-nodes are converted to their parents element nodes. This is because of how SimpleXML works and for what it has been designed for.

Compare with:

Which DOMNodes can be represented by SimpleXMLElement?
SimpleXML access seperated text nodes
Re: [PHP-DEV] SimpleXML->children() and text nodes

However, with some help of the sister-library DOMDocument which can represent text-nodes on their own, it is possible to get it to work:

<?php
/**
 * SimpleXML get Element Content between Child Elements
 * @link https://stackoverflow.com/q/20131226/367456
 */

$buffer = <<<BUFFER
<xml>
    <element>
        textpart1
            <subelement>subcontent1</subelement>
        textpart2
            <subelement>subcontent2</subelement>
        textpart3
    </element>
</xml>
BUFFER;

$xml = simplexml_load_string($buffer);

$xpath = new SimpleXMLXpath($xml);
$result = $xpath->query('/*/element//text()');
print_r($result);

The result output then is:

Array
(
    [0] => 
        textpart1

    [1] => subcontent1
    [2] => 
        textpart2

    [3] => subcontent2
    [4] => 
        textpart3

)

This is possible because of the SimpleXMLXpath class that wraps DOMXPath internally and stringifies the result in case it's a textnode:

/**
 * Class SimpleXMLXpath
 * 
 * @author hakre <http://hakre.wordpress.com/>
 */
class SimpleXMLXpath
{
    private $xml;

    public function __construct(SimpleXMLElement $xml)
    {
        $this->xml = $xml;
    }

    public function query($expression)
    {
        $context = dom_import_simplexml($this->xml);
        $xpath   = new DOMXPath($context->ownerDocument);
        $result  = [];

        foreach ($xpath->query($expression, $context) as $node) {
            switch (TRUE) {
                case $node instanceof DOMText:
                    $result[] = $node->nodeValue;
                    continue;

                case $node instanceof DOMElement:
                case $node instanceof DOMAttr:
                    $result[] = simplexml_import_dom($node);
                    continue;
            }
        }

        return $result;
    }
}