I am parsing XML in PHP with SimpleXML and have an XML like this:
<xml>
<element>
textpart1
<subelement>subcontent1</subelement>
textpart2
<subelement>subcontent2</subelement>
textpart3
</element>
</xml>
When I do $xml->element
it naturally gives me the whole element, as in all three textparts.
So if I parse this into an array (with a foreach
for the children) I get:
0 => textpart1textpart2textpart3, 1 => subcontent1, 2 => subcontent2
I need a way to parse the <element>
node so that each textpart that stops at, or begins after a subelement is treated as its own element.
As a result I am looking for an ordered list that could be express in an array like this:
0 => textpart1, 1 => subcontent1, 2 => textpart2, 3 => subcontent2, 4 => textpart3
Is that possible without altering the XML file? Thanks in advance for any hints!
As others have said, SimpleXML doesn't have any support for accessing individual text nodes as separate entities, so you will need to supplement it with some DOM methods. Thankfully, you can switch between the two at will using dom_import_simplexml
and simplexml_import_dom
.
The key pieces of DOM functionality you need are:
- the DOMElement->childNodes member variable for accessing all nodes directly under a particular element as an iterable list
- the DOMNode->nodeType variable for determining if a particular child is a text node or an element
- the DOMNode->nodeValue variable to get the actual text
Given those, you can write a function which returns an array with a mixture of SimpleXML objects for child elements, and strings for child text nodes, something like this:
function get_child_elements_and_text_nodes($sx_element)
{
$return = array();
$dom_element = dom_import_simplexml($sx_element);
foreach ( $dom_element->childNodes as $dom_child )
{
switch ( $dom_child->nodeType )
{
case XML_TEXT_NODE:
$return[] = $dom_child->nodeValue;
break;
case XML_ELEMENT_NODE:
$return[] = simplexml_import_dom($dom_child);
break;
}
}
return $return;
}
In your case, you need to recurse down the tree, which makes it a little confusing if you mix DOM and SimpleXML as you go, so you could instead write the recursion entirely in DOM and convert the SimpleXML object before running it:
function recursively_find_text_nodes($dom_element)
{
$return = array();
foreach ( $dom_element->childNodes as $dom_child )
{
switch ( $dom_child->nodeType )
{
case XML_TEXT_NODE:
$return[] = $dom_child->nodeValue;
break;
case XML_ELEMENT_NODE:
$return = array_merge($return, recursively_find_text_nodes($dom_child));
break;
}
}
return $return;
}
$text_nodes = recursively_find_text_nodes(dom_import_simplexml($simplexml->element));
Here's a live demo of that last function.
The simple answer is no. SimpleXML does not implement any kind of support for text nodes.
In this case your best and preferred option is to use DOMDocument.
You are actually looking for all text-nodes that are descendants of the element
element node. This can be expressed as the following xpath:
/*/element//text()
Even SimpleXML has an xpath
method that does execute this query without any errors, the actual text-nodes are converted to their parents element nodes. This is because of how SimpleXML works and for what it has been designed for.
Compare with:
- Which DOMNodes can be represented by SimpleXMLElement?
- SimpleXML access seperated text nodes
- Re: [PHP-DEV] SimpleXML->children() and text nodes
However, with some help of the sister-library DOMDocument which can represent text-nodes on their own, it is possible to get it to work:
<?php
/**
* SimpleXML get Element Content between Child Elements
* @link https://stackoverflow.com/q/20131226/367456
*/
$buffer = <<<BUFFER
<xml>
<element>
textpart1
<subelement>subcontent1</subelement>
textpart2
<subelement>subcontent2</subelement>
textpart3
</element>
</xml>
BUFFER;
$xml = simplexml_load_string($buffer);
$xpath = new SimpleXMLXpath($xml);
$result = $xpath->query('/*/element//text()');
print_r($result);
The result output then is:
Array
(
[0] =>
textpart1
[1] => subcontent1
[2] =>
textpart2
[3] => subcontent2
[4] =>
textpart3
)
This is possible because of the SimpleXMLXpath
class that wraps DOMXPath
internally and stringifies the result in case it's a textnode:
/**
* Class SimpleXMLXpath
*
* @author hakre <http://hakre.wordpress.com/>
*/
class SimpleXMLXpath
{
private $xml;
public function __construct(SimpleXMLElement $xml)
{
$this->xml = $xml;
}
public function query($expression)
{
$context = dom_import_simplexml($this->xml);
$xpath = new DOMXPath($context->ownerDocument);
$result = [];
foreach ($xpath->query($expression, $context) as $node) {
switch (TRUE) {
case $node instanceof DOMText:
$result[] = $node->nodeValue;
continue;
case $node instanceof DOMElement:
case $node instanceof DOMAttr:
$result[] = simplexml_import_dom($node);
continue;
}
}
return $result;
}
}