Remove starting and ending spaces from XML element

2019-06-21 13:53发布

问题:

How can I remove all spacing characters before and after a XML field?

<data version="2.0">

  <field> 

     1 

  </field>        

  <field something=" some attribute here... "> 

     2  

  </field>

</data>

Notice that spacing before 1 and 2 and 'some attribute here...', I want to remove that with PHP.

if(($xml = simplexml_load_file($file)) === false) die();

print_r($xml);

Also the data doesn't appear to be string, I need to append (string) before each variable. Why?

回答1:

You may want to use something like this:

$str = file_get_contents($file);
$str = preg_replace('~\s*(<([^>]*)>[^<]*</\2>|<[^>]*>)\s*~','$1',$str);
$xml = simplexml_load_string($xml,'SimpleXMLElement', LIBXML_NOCDATA);

I haven't tried this, but you can find more on this at http://www.lonhosford.com/lonblog/2011/01/07/php-simplexml-load-xml-file-preserve-cdata-remove-whitespace-between-nodes-and-return-json/.

Note that the spaces between the opening and closing brackets (<x> _space_ </x>) and the attributes (<x attr=" _space_ ">) are actually part of the XML document's data (in contrast with the spaces between <x> _space_ <y>), so I would suggest that the source you use should be a bit less messy with spaces.



回答2:

Since simplexml_load_file() reads data into an array, you could do something like this:

function TrimArray($input){

    if (!is_array($input))
        return trim($input);

    return array_map('TrimArray', $input);
}


回答3:

To do that in PHP you first have to convert the document into a DOMDocument so that you can address the nodes you want to normalize the whitespace within properly via DOMXPath. The (xpath in) SimpleXMLElement is too limited to access text-nodes precisely enough as it would be needed for this operation.

An Xpath-query to access all text-nodes that are within leaf-elements and all attributes is:

//*[not(*)]/text() | //@*

Given that $xml is a SimpleXMLElement you could do white-space normalization like in the following example:

$doc   = dom_import_simplexml($xml)->ownerDocument;
$xpath = new DOMXPath($doc);
foreach ($xpath->query('//*[not(*)]/text()|//@*') as $node) {
    /** @var $node DOMText|DOMAttr */
    $node->nodeValue = trim(preg_replace('~\s+~u', ' ', $node->nodeValue), ' ');
}

You could perhaps stretch this to all text-nodes (as suggested in related Q&A), but this might require document normalization under circumstance. As text() in Xpath does not differ between text-nodes and Cdata-sections, you might want to skip on these type of nodes (DOMCdataSection) or expand them into text-nodes when loading the document (use the LIBXML_NOCDATA option for that) to achieve more useful results.


Also the data doesn't appear to be string, I need to append (string) before each variable. Why?

Because it's an object of type SimpleXMLElement, if you want the string value of such an object (element), you need to cast it to string. See as well the following reference question:

  • Forcing a SimpleXML Object to a string, regardless of context

And last but not least: don't trust print_r or var_dump when you use it on a SimpleXMLElement: it's not showing the truth. E.g. you could override __toString() which could also solve your issue:

class TrimXMLElement extends SimpleXMLElement
{
    public function __toString()
    {
        return trim(preg_replace('~\s+~u', ' ', parent::__toString()), ' ');
    }
}

$xml = simplexml_load_string($buffer, 'TrimXMLElement');

print_r($xml);

Even though casting to string would normally apply (e.g. with echo), the output of print_r still would not reflect these changes. So better not rely on it, it can never show the whole picture.


Full example code to this answer (Online Demo):

<?php
/**
 * Remove starting and ending spaces from XML elements
 *
 * @link https://stackoverflow.com/a/31793566/367456
 */

$buffer = <<<XML
<data version="2.0">

  <field>

     1

  </field>

  <field something=" some attribute here... ">

     2 <![CDATA[ 34 ]]>

  </field>

</data>
XML;

class TrimXMLElement extends SimpleXMLElement implements JsonSerializable
{
    public function __toString()
    {
        return trim(preg_replace('~\s+~u', ' ', parent::__toString()), ' ');
    }

    function jsonSerialize()
    {
        $array = (array) $this;

        array_walk_recursive($array, function(&$value) {
            if (is_string($value)) {
                $value  = trim(preg_replace('~\s+~u', ' ', $value), ' ');
            }
        });

        return $array;
    }
}

$xml = simplexml_load_string($buffer, 'TrimXMLElement', LIBXML_NOCDATA);

print_r($xml);
echo json_encode($xml);

$xml = simplexml_load_string($buffer, null, LIBXML_NOCDATA);

$doc = dom_import_simplexml($xml)->ownerDocument;
$doc->normalizeDocument();
$doc->normalize();

$xpath = new DOMXPath($doc);
foreach ($xpath->query('//*[not(*)]/text()|//@*') as $node) {
    /** @var $node DOMText|DOMAttr|DOMCdataSection */
    if ($node instanceof DOMCdataSection) {
        continue;
    }
    $node->nodeValue = trim(preg_replace('~\s+~u', ' ', $node->nodeValue), ' ');
}

echo $xml->asXML();