I'm using SimpleXML to try to parse a large XML file with <!ENTITY
declarations. Unfortunately, SimpleXML seems too eager to go ahead and expand those entities, and I'd rather it didn't, since the entity symbols are short, easily parseable, and theoretically won't change in newer versions of the file, while the expanded entities are English sentences that may change. Is there any way to tell SimpleXML to knock it off?
I've thought of "pre-parsing" the XML file to strip out the <!ENTITY
bits before passing the file contents to the XML parser, but that feels hacky, and since it's a huge file, I'd rather do as little fiddling with it as possible.
(Pardon any mistaken terminology in the above; I haven't done this level of XML work in quite a while.)
It might seem so, but it's not the case (unless you specify the flag which I guess you don't albeit you don't show in code what you do). It's just that SimpleXML can only return it to you if you're using the ->asXML()
method not via the to-string-implementation.
Let's do some example to demonstrate how it works. I've picked this simple entity from the DTD:
<!ENTITY n "noun (common) (futsuumeishi)">
So let's select the first <pos>
element as it contains an &n;
entity:
$xml = simplexml_load_file($file);
$pos = $xml->entry->sense->pos;
The variable $pos
is now the SimpleXMLElement of the <pos>
element node. Let's output it to see what the parser does with the &n;
entity:
echo "SimpleXML value (string): ", $pos , "\n"
, "SimpleXML value (XML) : ", $pos->asXML(), "\n";
Output is:
SimpleXML value (string): noun (common) (futsuumeishi)
SimpleXML value (XML) : <pos>&n;</pos>
As this example shows, the &n;
is still there (<pos>&n;</pos>
), it's just that it will be expanded the moment you access it as the string value (noun (common) (futsuumeishi)
).
This by the way is totally OK, the XML specs say here that it's up to the parser whether to expand those entities or not. For what SimpleXML has been designed for, this is totally expected to expand when reading the string value.
You can even control this behavior by specifying the LIBXML_NOENT
option:
$xml = simplexml_load_file($file, NULL, LIBXML_NOENT);
This will actually do what you assume then, the entities are expanded now, the XML output does not contain the entity any longer:
SimpleXML value (string): noun (common) (futsuumeishi)
SimpleXML value (XML) : <pos>noun (common) (futsuumeishi)</pos>
So now double question mark how to do what you're looking for? Well, an XML parser in PHP which actually has a model for entities is DOMDocument. It is a sister library of SimpleXML, internally both share the same memory objects. Here is the output of that same object (more precise: its only child node) for those two modes without and with LIBXML_NOENT
:
Mode 1:
DOMDocument Class : DOMEntityReference
DOMDocument value(XML) : &n;
DOMDocument ->nodeName : n
Mode 2 (LIBXML_NOENT):
DOMDocument Class : DOMText
DOMDocument value(XML) : noun (common) (futsuumeishi)
DOMDocument ->nodeName : #text
This is created by the following code which should make more visible what is behind the given output:
$node = dom_import_simplexml($pos);
$doc = $node->ownerDocument;
$entity = $node->firstChild;
echo "DOMDocument Class : ", get_class($entity) , "\n"
, "DOMDocument value(XML) : ", $doc->saveXML($entity), "\n"
, "DOMDocument ->nodeName : ", $entity->nodeName , "\n";
As written it is a sister library and dom_import_simplexml
turns $pos
into a DOMElement
of which we need to traverse the children of it which we know is the entity reference in question.
So now this starts to make perfect sense: As SimpleXML can not represent an Entity Reference, it can only provide the expanded string value or the XML containing the entity.
Otherwise what would be the way to differ the string value of
<pos>&n;</pos>
<pos><![CDATA[&n;]]></pos>
? So what you ask for makes only limited sense. However that doesn't mean we could not deal with that and so therefore can trick SimpleXML to do that by extending from it. Let's say each child element that only contains a single entity should return so. Otherwise standard SimpleXML stringyfication should be used:
/**
* Class EntityPreserveXML
*/
class EntityPreserveXML extends SimpleXMLElement
{
/**
* @return string
*/
public function __toString()
{
$dom = dom_import_simplexml($this);
if (
!$dom instanceof DOMElement
|| $dom->childNodes->length !== 1
|| ! $dom->firstChild instanceof DOMEntityReference
) {
return parent::__toString();
}
return $dom->ownerDocument->saveXML($dom->firstChild);
}
}
Let's just let that run on our example from above:
require('EntityPreserveXML.php');
$xml = simplexml_load_file($file, 'EntityPreserveXML');
$pos = $xml->entry->sense->pos;
echo "SimpleXML value (string): ", $pos , "\n"
, "SimpleXML value (XML) : ", $pos->asXML(), "\n";
SimpleXML is now using the extended class, which then gives as expected:
SimpleXML value (string): &n;
SimpleXML value (XML) : <pos>&n;</pos>
The &n;
as it is the only child is now preserved in the to-string conversion of the SimpleXMLElement. But only because this works must not mean you should use this, it breaks an encoding boundary between parsed XML in the form of text and just XML in the meaning of the Document Model.
Probably you're just looking for DOMDocument? It's a model with much more details from which you can just use DOMEntityReference
s if there are any.